o
    SiB                     @   s   d dl mZ d dlmZmZmZmZ d dlZd dl	Z	d dl
m  mZ d dlmZ d dlmZ d dlmZmZ d dlmZmZmZ d dlmZ G d	d
 d
e	jjjZdddZdeddfddZ dS )    )defaultdict)CallableDictListUnionN)validate)CutSet)BatchIOPrecomputedFeatures)LOG_EPSILONcompute_num_framesifnone)Hdf5MemoryIssueFixc                       s   e Zd ZdZddddddde edf
dededed	ed
ede	e
egef  de	e
ejgejf  dededef fddZdedeeeeje	e f f fddZ  ZS )K2SurtDataseta  
    The PyTorch Dataset for the multi-talker ASR task using k2 library.
    We support the modeling framework known as Streaming Unmixing and Recognition
    Transducer (SURT), as described in [1] and [2], but this dataset can also be
    used for other multi-talker ASR approaches such as MT-RNNT [3] and SOT [4].
    See icefall recipe for usage: https://github.com/k2-fsa/icefall/pull/1126.

    We take a cut containing possibly overlapping speech and split the supervision
    segments into one of N channels based on their start times (N is provided), known
    as ``heuristic error assignment training`` (HEAT) [1]. The supervision segments
    in each channel are then concatenated and used as the supervision for that channel.
    If we have features for the source cuts, we can also return them for use in masking
    losses, for instance.

    [1] Lu, L., Kanda, N., Li, J., & Gong, Y. (2021). Streaming end-to-end multi-talker
    speech recognition. IEEE Signal Processing Letters, 28, 803-807.

    [2] Raj, D., Lu, L., Chen, Z., Gaur, Y., & Li, J. (2022, May). Continuous streaming
    multi-talker asr with dual-path transducers. In ICASSP 2022-2022 IEEE International
    Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7317-7321). IEEE.

    [3] Sklyar, I., Piunova, A., Zheng, X., & Liu, Y. (2022, May). Multi-turn RNN-T for
    streaming recognition of multi-party speech. In ICASSP 2022-2022 IEEE International
    Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8402-8406). IEEE.

    [4] Kanda, N., Gaur, Y., Wang, X., Meng, Z., & Yoshioka, T. (2020). Serialized output
    training for end-to-end overlapped speech recognition. arXiv preprint arXiv:2003.12687.

    .. hint:: Training mixtures can be simulated from single-speaker utterances using
        :class:`~lhotse.workflows.MeetingSimulation` workflow.

    This dataset expects to be queried with lists of cut IDs,
    for which it loads features and automatically collates/batches them.

    To use it with a PyTorch DataLoader, set ``batch_size=None``
    and provide a :class:`SimpleCutSampler` sampler.

    Each item in this dataset is a dict of:

    .. code-block::

        {
            'inputs': float tensor with shape determined by :attr:`input_strategy`:
                      - single-channel:
                        - features: (B, T, F)
                        - audio: (B, T)
                      - multi-channel: currently not supported
            'input_lens': int tensor of shape (B,)
            'supervisions': list of lists of supervision segments, where the outer list is
                        batch, and the inner list is indexed by channel. So ``len(supervisions) == B``,
                        and ``len(supervisions[i]) == num_channels``. Note that some channels may
                        have no supervision segments.
            'text': list of lists of strings, where the outer list is batch, and the inner list
                    is indexed by channel. So ``len(text) == B``, and ``len(text[i]) == num_channels``.
                    Each element contains the text of the supervision segments in that channel,
                    joined by the :attr:`text_delimiter`. Note that some channels may have no
                    supervision segments, so the corresponding text will be an empty string.
        }

    If ``return_cuts`` is ``True``, each item will also contain a ``cuts`` field with the
    :class:`~lhotse.cut.Cut` objects used to create the batch.

    Additionally, if ``return_sources`` is ``True``, each item will contain:

    .. code-block::

        {
            'source_feats': list of list of float tensors. The outer list is batch, and the inner
                            list is number of segments in the mixture. Each element denotes
                            the features of the source cut in the mixture.
            'source_boundaries': list of list of tuples, where the outer list is batch, and the inner list
                                is number of segments in the mixture. Each element denotes
                                the start and end frame of the source cut in the mixture.
        }

    In order to return the source features and boundaries, we expect the cuts to contain
    some additional fields:

    - ``source_feats``: a float tensor representing the features of all the source segments,
        concatenated together along T dimension, in order of their start times.
    - ``source_feat_offsets``: a list of ints, where each element denotes the offset of the
        source segments in the ``source_feats`` tensor.

    See https://github.com/lhotse-speech/lhotse/discussions/1008#discussioncomment-5511746
    for example code to create these fields.


    Dimension symbols legend:
    * ``B`` - batch size (number of Cuts)
    * ``S`` - number of supervision segments (greater or equal to B, as each Cut may have multiple supervisions)
    * ``T`` - number of frames of the longest Cut
    * ``F`` - number of features

    The 'sequence_idx' field is the index of the Cut used to create the example in the Dataset.
    F    Nreturn_cutsreturn_sourcesreturn_alignmentsnum_channelstext_delimitercut_transformsinput_transformsinput_strategy	pad_valuestrictc                    sb   t    || _|| _|| _|| _|| _t|g | _t|g | _	|| _
|	| _|
| _tdd| _dS )a  
        k2 ASR IterableDataset constructor.

        :param return_cuts: When ``True``, will additionally return a "cut" field in each batch with the Cut
            objects used to create that batch.
        :param return_sources: When ``True``, will additionally return a "source_feats" field and a "source_boundaries"
            field in each batch. The "source_feats" field contains the features of the source cuts from
            which the mixture was created, and "source_boundaries" contains the boundaries of the source cuts
            in the mixture (in number of frames). This requires that the cuts contain additional fields
            ``source_feats`` (which is a TemporalArray) and ``source_feat_offsets`` (which is a list of ints).
        :param return_alignments: When ``True``, will keep the supervision alignments if they
            are present in the cuts.
        :param num_channels: Number of output branches. The supervision utterances will be
            split into the channels based on their start times.
        :param text_delimiter: The delimiter used to join the text of the supervision segments in
            each channel.
        :param cut_transforms: A list of transforms to be applied on each sampled batch,
            before converting cuts to an input representation (audio/features).
            Examples: cut concatenation, noise cuts mixing, etc.
        :param input_transforms: A list of transforms to be applied on each sampled batch,
            after the cuts are converted to audio/features.
            Examples: normalization, SpecAugment, etc.
        :param input_strategy: The strategy used to convert the cuts to audio/features.
        :param pad_value: The value used to pad the source features to resolve one-off errors.
        :param strict: If ``True``, we will remove cuts that have more simultaneous supervisions
            than the number of channels. If ``False``, we will keep them.
        d   )reset_intervalN)super__init__r   r   r   r   r   r   r   r   r   r   r   r   hdf5_fix)selfr   r   r   r   r   r   r   r   r   r   	__class__ G/home/ubuntu/.local/lib/python3.10/site-packages/lhotse/dataset/surt.pyr   p   s   
(zK2SurtDataset.__init__cutsreturnc                    s  t | j  js| }|jdd}jD ]}||}qtt}g g }g }|D ]݉ dd t	j
D }dd t	j
D }g }g }	d}
t jdd dD ]L}d}t	j
D ]&}t|| d	ksk|| |jkr|| | t|| |j||< d
} nqZ|sd
}
|t|}|| | t|| |j||< qQjr j}t|t jksJ dt| dt j ddd t  |dd D } fddt jdd dD }	fddt||	D }|
rjr j q,|| j< jr	|| ||	 q,td	kr)tdt dt| d |fdd}|}t|dkr;|\}}}n|\}}||t| fdd| D d}j r[||d< jrg||d< ||d< |S )z
        Return a new batch, with the batch size automatically determined using the constraints
        of max_duration and max_cuts.
        F)	ascendingc                 S   s   g | ]}g qS r$   r$   .0_r$   r$   r%   
<listcomp>       z-K2SurtDataset.__getitem__.<locals>.<listcomp>c                 S   s   g | ]}d qS )        r$   r)   r$   r$   r%   r,      r-   c                 S   s   | j S N)startsr$   r$   r%   <lambda>   s    z+K2SurtDataset.__getitem__.<locals>.<lambda>)keyr   TzWThe number of source feature offsets should be equal to the number of supervisions.Got z offsets for z supervisions.c                 S   s   g | ]}t |qS r$   )torch
from_numpy)r*   xr$   r$   r%   r,      s       Nc                    s0   g | ]}t |j j jt |j j jfqS r$   )r   r0   frame_shiftsampling_rateendr*   supcutr$   r%   r,      s    c                 S   s   | j | jfS r/   )r0   speakerr1   r$   r$   r%   r3      s    c                    s(   g | ]\}\}}t |||  jd qS ))padding_value)adjust_source_featsr   )r*   r7   r0   r;   r!   r$   r%   r,     s    
z	WARNING: z cuts were removed out of z0 due to more overlapping speakers than channels.c                    s
   | j  vS r/   )idr>   )invalid_cutsr$   r%   r3     s   
    c                    s   g | ]} fd d|D qS )c                    s"   g | ]} j d d |D qS )c                 S   s   g | ]}|j  qS r$   )textstripr<   r$   r$   r%   r,   '  s    zCK2SurtDataset.__getitem__.<locals>.<listcomp>.<listcomp>.<listcomp>)r   join)r*   sups_chrC   r$   r%   r,   &  s    z8K2SurtDataset.__getitem__.<locals>.<listcomp>.<listcomp>r$   )r*   cut_supsrC   r$   r%   r,   %  s    
)inputs
input_lenssupervisionsrG   r&   source_featssource_boundaries)!validate_for_asrr    updater   drop_alignmentssort_by_durationr   r   listranger   sortedrN   lenr0   appendmaxr;   indexminr   source_feat_offsetsnpsplitload_source_featszipr   rD   printfilterr   valuesr   )r!   r&   tnfmrN   rO   rP   rK   last_sup_endcut_sourcescut_source_boundariesinvalid_cutr=   assignedimin_end_channelr]   	input_tplrL   rM   batchr$   )r?   rE   r!   r%   __getitem__   s   













zK2SurtDataset.__getitem__)__name__
__module____qualname____doc__r
   r   boolintstrr   r   r   r5   Tensorr	   floatr   r   r   ro   __classcell__r$   r$   r"   r%   r      sD    b	
0:r   r.   r   c                 C   s   | j d |kr	| S t| j d | |kr"td| j d  d| d| j d |k r:tj| ddd|| j d  f|dS | d| S )aU  
    Adjust the number of frames in the source features to match the supervision.
    If the source features have fewer frames than the supervision, we pad them
    to match the supervision. If the source features have more frames than the
    supervision, we trim them to match the supervision.

    Args:
        feats: Source features.
        num_frames: Number of frames in the supervision.
        padding_value: Value to use for padding.
        tol: Tolerance for checking if the number of frames in the source features
            is close to the number of frames in the supervision.
    r   z)Number of frames in the source features (z;) is not close to the number of frames in the supervision (z).)valueN)shapeabs
ValueErrorFpad)feats
num_framesrA   tolr$   r$   r%   rB   5  s   "rB   r&   r'   c                 C   sv   t |  d}| D ]0}|jD ]*}|j| ks"J d|j d|j d|j|j| ks7J d|j d|j dqqd S )NgMb`?zHSupervisions starting before the cut are not supported for ASR (sup id: z
, cut id: )zESupervisions ending after the cut are not supported for ASR (sup id: )r   rN   r0   rD   duration)r&   r   r?   supervisionr$   r$   r%   rQ   T  s(   
rQ   )r.   r   )!collectionsr   typingr   r   r   r   numpyr^   r5   torch.nn.functionalnn
functionalr~   lhotser   
lhotse.cutr   lhotse.dataset.input_strategiesr	   r
   lhotse.utilsr   r   r   lhotse.workaroundsr   utilsdataDatasetr   rB   rQ   r$   r$   r$   r%   <module>   s      
(