o }o™iÍ<ã@sddlZddlZddlZddlmZddlmZmZmZm Z m Z mZmZddl ZddlZddlmZmZddlmZddlmZmZGdd„dejƒZGd d „d eƒZGdd„deƒZGd d„deƒZGdd„deƒZGdd„deƒZGdd„deƒZ Gdd„de!ƒZ"Gdd„deƒZ#Gdd„de!ƒZ$Gdd„de$ƒZ%Gdd„deƒZ&Gdd „d e&ƒZ'Gd!d"„d"eƒZ(Gd#d$„d$e(ƒZ)Gd%d&„d&eƒZ*Gd'd(„d(e*ƒZ+Gd)d*„d*eƒZ,Gd+d,„d,e,ƒZ-Gd-d.„d.eƒZ.Gd/d0„d0e.ƒZ/Gd1d2„d2eƒZ0Gd3d4„d4e0ƒZ1Gd5d6„d6eƒZ2Gd7d8„d8e2ƒZ3dS)9éN)Úcombinations)ÚAnyÚCallableÚDictÚIterableÚListÚOptionalÚUnion)ÚmanifestÚparsers)Ú get_full_path)ÚloggingÚlogging_modec@seZdZdZdZdS)Ú_Collectionz%List of parsed and preprocessed data.N)Ú__name__Ú __module__Ú__qualname__Ú__doc__ÚOUTPUT_TYPE©rrúk/home/ubuntu/.local/lib/python3.10/site-packages/nemo/collections/common/parts/preprocessing/collections.pyrsrcs<eZdZdZe dd¡Zdeede j f‡fdd„Z‡ZS)ÚTextzCSimple list of preprocessed text entries, result in list of tokens.Ú TextEntityÚtokensÚtextsÚparsercsRg|j}}|D]}||ƒ}|durt d|¡q| ||ƒ¡qtƒ |¡dS)zÉInstantiates text manifest and do the preprocessing step. Args: texts: List of raw texts strings. parser: Instance of `CharParser` to convert string to tokens. NzFail to parse '%s' text line.)rr ÚwarningÚappendÚsuperÚ__init__)ÚselfrrÚdataÚoutput_typeÚtextr©Ú __class__rrr(sz Text.__init__) rrrrÚcollectionsÚ namedtuplerrÚstrrÚ CharParserrÚ __classcell__rrr$rr#s$rcsFeZdZdZdedejf‡fdd„Zedede efdd„ƒZ ‡ZS) ÚFromFileTextz6Another form of texts manifest with reading from file.Úfilercs| |¡}tƒ ||¡dS)zÅInstantiates text manifest and do the preprocessing step. Args: file: File path to read from. parser: Instance of `CharParser` to convert string to tokens. N)Ú_FromFileText__parse_textsrr)r r,rrr$rrr@s zFromFileText.__init__ÚreturncCsžtj |¡s tdƒ‚tj |¡\}}|dkr!t |¡d ¡}|S|dkr3tdd„t |¡Dƒƒ}|St|dƒ }| ¡}Wdƒ|S1sHwY|S)Nz$Provided texts file does not exists!z.csvÚ transcriptz.jsoncss|]}|dVqdS)r#Nr)Ú.0ÚitemrrrÚ Us€z-FromFileText.__parse_texts..Úr) ÚosÚpathÚexistsÚ ValueErrorÚsplitextÚpdÚread_csvÚtolistÚlistr Ú item_iterÚopenÚ readlines)r,Ú_ÚextrÚfrrrÚ __parse_textsLsúý ÿýzFromFileText.__parse_texts)rrrrr(rr)rÚstaticmethodrr-r*rrr$rr+=s r+cóºeZdZdZejdddZ ddeedee d ee d ee dee deeed eeedeeedeee dej dee dee deededef‡fdd„ Z‡ZS)Ú AudioTextú@List of audio-transcript text correspondence with preprocessing.ÚAudioTextEntityzGid audio_file duration text_tokens offset text_raw speaker orig_sr lang©ÚtypenameÚfield_namesNFÚidsÚaudio_filesÚ durationsrÚoffsetsÚspeakersÚorig_sampling_ratesÚtoken_labelsÚlangsrÚmin_durationÚmax_durationÚ max_numberÚdo_sort_by_durationÚindex_by_file_idc! s"|j}d}gdddf\}}}}|ri|_t||||||||| ƒ D]º\ }}}}}}}}}|dur1d}|durF|durF||krF||7}|d7}q |dur[|dur[||kr[||7}|d7}q |durb|}n3|dkr†t| dƒr| jrt|tƒr|dur}| ||ƒ}ntd ƒ‚| |ƒ}ng}|dur•||7}|d7}q ||durœ|nd7}| ||||||||||ƒ ¡|rÒt j t j |¡¡\}} ||jvrÆg|j|<|j| t |ƒd¡t |ƒ| krÚnq |rí|råt d ¡n|jdd„d t dt |ƒ|d¡t d||d¡|s t d¡tƒ |¡dS)a Instantiates audio-text manifest with filters and preprocessing. Args: ids: List of examples positions. audio_files: List of audio files. durations: List of float durations. texts: List of raw text transcripts. offsets: List of duration offsets or None. speakers: List of optional speakers ids. orig_sampling_rates: List of original sampling rates of audio files. langs: List of language ids, one for eadh sample, or None. parser: Instance of `CharParser` to convert string to tokens. min_duration: Minimum duration to keep entry with (default: None). max_duration: Maximum duration to keep entry with (default: None). max_number: Maximum number of samples to collect. do_sort_by_duration: True if sort samples list by duration. Not compatible with index_by_file_id. index_by_file_id: If True, saves a mapping from filename base (ID) to index in data. TçrNFéÚÚis_aggregateú9lang required in manifest when using aggregate tokenizersúLTried to sort dataset by duration, but cannot since index_by_file_id is set.cSó|jS©N©Úduration©ÚentityrrrÚÇóz$AudioText.__init__..©Úkeyú1Dataset loaded with %d files totalling %.2f hourséú+%d files were filtered totalling %.2f hourszRNot all audios have duration information, the total number of hours is inaccurate.©rÚmappingÚzipÚhasattrr\Ú isinstancer(r7rr4r5r8ÚbasenameÚlenr rÚsortÚinforr)!r rLrMrNrrOrPrQrRrSrrTrUrVrWrXr"Úall_has_durationr!Úduration_filteredÚnum_filteredÚtotal_durationÚid_Ú audio_filerbÚoffsetr#ÚspeakerÚorig_srÚlangÚtext_tokensÚfile_idr@r$rrresd%ÿ ÿ zAudioText.__init__©NNNFF©rrrrr&r'rrÚintr(Úfloatrrr)Úboolrr*rrr$rrF]óTþðþýüûú ù ø ÷ öõô óòñðrFcrE)Ú VideoTextz@List of video-transcript text correspondence with preprocessing.rHzGid video_file duration text_tokens offset text_raw speaker orig_sr langrINFrLÚvideo_filesrNrrOrPrQrRrSrrTrUrVrWrXc sæ|j}gdddf\}}}}|ri|_t||||||||| ƒ D]¦\ }}}}}}}}}|dur:||kr:||7}|d7}q|durK||krK||7}|d7}q|durR|}n3|dkrvt| dƒrq| jrqt|tƒrq|durm| ||ƒ}ntdƒ‚| |ƒ}ng}|dur…||7}|d7}q||7}| ||||||||||ƒ ¡|r¼t j t j |¡¡\}}||jvr°g|j|<|j| t |ƒd¡t |ƒ| krÄnq|r×|rÏt d¡n|jd d „dt dt |ƒ|d ¡t d||d ¡tƒ |¡dS)a Instantiates video-text manifest with filters and preprocessing. Args: ids: List of examples positions. video_files: List of video files. durations: List of float durations. texts: List of raw text transcripts. offsets: List of duration offsets or None. speakers: List of optional speakers ids. orig_sampling_rates: List of original sampling rates of audio files. langs: List of language ids, one for eadh sample, or None. parser: Instance of `CharParser` to convert string to tokens. min_duration: Minimum duration to keep entry with (default: None). max_duration: Maximum duration to keep entry with (default: None). max_number: Maximum number of samples to collect. do_sort_by_duration: True if sort samples list by duration. Not compatible with index_by_file_id. index_by_file_id: If True, saves a mapping from filename base (ID) to index in data. rYrNrZr[r\r]r^cSr_r`rarcrrrre4rfz$VideoText.__init__..rgrirjrkrl) r rLrˆrNrrOrPrQrRrSrrTrUrVrWrXr"r!rvrwrxryÚ video_filerbr{r#r|r}r~rr€r@r$rrrØsZ%ÿ ÿzVideoText.__init__rr‚rrr$rr‡Ðr†r‡csŠeZdZdZejdddZ ddeee efde ed e ed e ede ede d e de de f‡fdd„ Zdd„Z‡ZS)ÚInstructionTuningAudioTextú5`AudioText` collector from asr structured json files.ÚInstructionTuningTextzjid context context_type context_duration question question_type answer answer_type answer_duration speakerrINFÚmanifests_filesrTrUÚmax_seq_lengthrVrWrXÚdecoder_only_modelÚuse_phoneme_tokenizerc !sX|j} | |_gdddf\}}} }|ri|_t |¡D]á}|d}|d}|d}|d}|d}|d}|d }|d }|d}|d}|d }|durNdn|}|dkrV|n|}|duri||kri||7}| d7} q|durz||krz||7}| d7} qt| |||¡ddƒ}| ||d¡}| |||¡}|rž||||ks¨|||ks¨||kr±||7}| d7} q||7}| | ||||||||||ƒ ¡|rótj tj |¡¡\}} d|vrÝ|dd…}||jvrçg|j|<|j| t|ƒd¡t|ƒ|krûnq|r|rt d¡n|jdd„dt dt|ƒ|d¡t d| |d¡tƒ |¡dS)aHParse lists of audio files, durations and transcripts texts. Args: manifests_files: Either single string file or list of such - manifests to yield items from. *args: Args to pass to `AudioText` constructor. **kwargs: Kwargs to pass to `AudioText` constructor. rYrÚidÚcontextÚcontext_durationÚcontext_typeÚquestionÚ question_typer|ÚanswerÚanswer_durationÚanswer_typeÚtaskNÚttsrZg333333Ó?iz.contextiøÿÿÿr^cSr_r`rarcrrrre§rfz5InstructionTuningAudioText.__init__..rgrirjrk)rrrmr r=ÚminÚ_get_lenrr4r5r8rqrrr rrsrtrr)!r rrTrUrŽrVrWrXrrr"r!rvrwrxr1r‘r’r“r”r•r–r|r—r˜r™ršrbÚapprox_context_lenÚapprox_question_lenÚapprox_answer_lenr€r@r$rrrGsŠÿöÿ ÿz#InstructionTuningAudioText.__init__cCs\|dkr|dS|dkr|jrt|ƒSt| d¡ƒdS|dkr&t|ƒdStd|›dƒ‚) NÚSPEECHéLÚTEXTú éÚTOKENSzUnknown field type Ú.)rrrÚsplitr7)r Ú field_typer!Ú duration_datarrrr®sz#InstructionTuningAudioText._get_len)NNNNFFFF)rrrrr&r'rr r(rrr„rƒr…rrr*rrr$rrŠ<sDýöþýüûúùø ÷ ögrŠcs<eZdZdZddeeeefdeef‡fdd„ Z ‡Z S)ÚASRAudioTextr‹NrÚ parse_funcc sìgggggf\}}}}} ggggf\} }}} tj||dD]A}| |d¡| |d¡| |d¡| |d¡| |d¡| |d¡| |d¡| |d ¡| |d ¡qtƒj||||| | ||| g |¢Ri|¤ŽdS)áIParse lists of audio files, durations and transcripts texts. Args: manifests_files: Either single string file or list of such - manifests to yield items from. *args: Args to pass to `AudioText` constructor. **kwargs: Kwargs to pass to `AudioText` constructor. ©r¬r‘rzrbr#r{r|r}rRr~N©r r=rrr)r rr¬ÚargsÚkwargsrLrMrNrrOrPÚorig_srsrRrSr1r$rrrÀs<ûú ÿÿ ÿzASRAudioText.__init__r`)rrrrr r(rrrrr*rrr$rr«½s0r«c@seZdZdZddd„ZdS)ÚSpeechLLMAudioTextEntityz(Class for SpeechLLM dataloader instance.r.Nc Cs:||_||_||_||_||_||_||_||_| |_dS)zCInitialize the AudioTextEntity for a SpeechLLM dataloader instance.N) r‘rzrbr’r—r{r|r}r~) r Úsidrzrbr’r—r{r|r}r~rrrrës z!SpeechLLMAudioTextEntity.__init__)r.N)rrrrrrrrrr³èsr³có2eZdZdZdeeeeff‡fdd„Z‡ZS)ÚASRVideoTextz4`VideoText` collector from cv structured json files.rc sègggggf\}}}}}ggggf\} } }}t |¡D]A} | | d¡| | d¡| | d¡| | d¡| | d¡| | d¡| | d¡| | d¡| | d ¡qtƒj|||||| | ||g |¢Ri|¤Žd S)aIParse lists of video files, durations and transcripts texts. Args: manifests_files: Either single string file or list of such - manifests to yield items from. *args: Args to pass to `VideoText` constructor. **kwargs: Kwargs to pass to `VideoText` constructor. r‘r‰rbr#r{r|r}rRr~Nr¯)r rr°r±rLrˆrNrrOrPr²rRrSr1r$rrrûs<ûúÿÿ ÿzASRVideoText.__init__© rrrrr r(rrr*rrr$rr¶øó&r¶c @s´eZdZdZ ddeedeedeedeedeed eed eeedeeedeeed eedeedeede de deefdd„Z dd„Zdd„ZdS)ÚSpeechLLMAudioTextzÁList of audio-transcript text correspondence with preprocessing. All of the audio, duration, context, answer are optional. If answer is not present, text is treated as the answer. NFrLrMrNÚcontext_listÚanswersrOrPrQrSrTrUrVrWrXÚmax_num_samplesc# sžgdddf\‰}}}|ri|_t||||||||| ƒ D]¢\ }}}}}}}}}|durqt|tƒr3t|ƒn|}t|tƒr>t|ƒn|}t|tƒrIt|ƒn|}| dur\|| kr\||7}|d7}q|durm||krm||7}|d7}q||7}|dur~||7}|d7}qˆ t|||||||||ƒ ¡|rµ|durµt j t j |¡¡\}} ||jvr©g|j|<|j| t ˆƒd¡t ˆƒ|kr½nq|dur|s|t ˆƒkrát dt ˆƒ›d|›d¡ˆd|…‰nAt dt ˆƒ›d|›d¡ˆ|t ˆƒ‰|t ˆƒ}!‡fd d „tjjt ˆƒ|!ddDƒ}"ˆ |"¡n |dur"|r"t d ¡| r6|r.t d¡nˆjdd„dt dt ˆƒ|d¡t d||d¡ˆ|_dS)aInstantiates audio-context-answer manifest with filters and preprocessing. Args: ids: List of examples positions. audio_files: List of audio files. durations: List of float durations. context_list: List of raw text transcripts. answers: List of raw text transcripts. offsets: List of duration offsets or None. speakers: List of optional speakers ids. orig_sampling_rates: List of original sampling rates of audio files. langs: List of language ids, one for eadh sample, or None. min_duration: Minimum duration to keep entry with (default: None). max_duration: Maximum duration to keep entry with (default: None). max_number: Maximum number of samples to collect. do_sort_by_duration: True if sort samples list by duration. Not compatible with index_by_file_id. index_by_file_id: If True, saves a mapping from filename base (ID) to index in data. rYrNrZzSubsampling dataset from z to z sampleszOversampling dataset from csg|]}ˆ|‘qSrr)r0Úidx©r!rrÚ óz/SpeechLLMAudioText.__init__..F)ÚreplacezXTried to subsample dataset by max_num_samples, but cannot since index_by_file_id is set.r^cSr_r`rarcrrrreˆrfz-SpeechLLMAudioText.__init__..rgrirjrk)rmrnrpr<rœÚmaxÚsumrr³r4r5r8rqrrr rtÚnpÚrandomÚchoiceÚextendrrsr!)#r rLrMrNrºr»rOrPrQrSrTrUrVrWrXr¼rvrwrxryrzrbr{r’r—r|r}r~Úcurr_min_durÚcurr_max_durÚcurr_sum_durr€r@Úres_numÚres_datarr¾rr)sh%ÿÿ ÿ$ zSpeechLLMAudioText.__init__cCs<|dks|t|jƒkrtdt|jƒ›d|›dƒ‚|j|S)Nrzindex out of range [0,z), got z instead)rrr!r7)r r½rrrÚ__getitem__s zSpeechLLMAudioText.__getitem__cCs t|jƒSr`)rrr!)r rrrÚ__len__”s zSpeechLLMAudioText.__len__)NNNFFN) rrrrrrƒr(r„rr…rrÍrÎrrrrr¹"sRðþýüûúù ø ÷ öõô óòñ ðfr¹c steZdZdZ ddeeeefdeeeeefdedef‡fd d „ Zdeded e ee ffdd„Z‡ZS)ÚSpeechLLMAudioTextCollectionz`SpeechLLMAudioText` collector from SpeechLLM json files. This collector also keeps backward compatibility with SpeechLLMAudioText. Nr’r—rÚcontext_fileÚcontext_keyÚ answer_keyc s˜||_||_ggggggf\}}} } }}ggg} }}|durht|tƒr*| d¡n|}g|_|D]*}t|dƒ}| ¡D]}| ¡}|rK|j |¡q=Wdƒn1sVwYq1t d|›d|›¡nd|_tj ||jdD]A}| |d¡| |d¡| |d ¡| |d ¡| |d¡| |d¡| |d ¡| |d¡| |d¡qstƒj||| | ||| ||g |¢Ri|¤ŽdS)rNú,r3zUse random text context from z for r®r‘rzrbr’r—r{r|r}r~)rÑrÒrpr(r¨rºr>r?Ústriprr rtr r=Ú)_SpeechLLMAudioTextCollection__parse_itemrr)r rrÐrÑrÒr°r±rLrMrNrºr»rOrPr²rSÚquestion_file_listÚfilepathrBÚliner1r$rrržsh úù ý€ýÿ€ÿÿ ÿz%SpeechLLMAudioTextCollection.__init__rØÚ manifest_filer.cCst |¡}d|vr| d¡|d<nd|vr| d¡|d<nd|vr%d|d<|ddur6tj|d|d|d<d|vr>d|d<|j|vrL| |j¡|d<n3d|vrX| d¡|d<n'd|vr{t| d¡d ƒ}| ¡|d<Wdƒn1suwYnd |d<|j|vr| |j¡|d<nUd|vr°t| d¡d ƒ}| ¡|d<Wdƒn1sªwYn2|j durÃt j |j ¡ ¡}||d<nd |vrÞtjd|j›d|›tjd| d ¡|d<nd|d<t|d|dt|dƒt|dƒ| dd¡| dd¡| dd¡| dd¡d}|S)NÚaudio_filenamerzÚaudio_filepath©rzrÙrbr—r#Ú text_filepathr3Únar’Úcontext_filepathr•z Neither `zC` is found nor`context_file` is set, but found `question` in item: )Úmodezwhat does this audio meanr{r|Úorig_sample_rater~)rzrbr’r—r{r|r}r~)ÚjsonÚloadsÚpopr rrÒr>ÚreadrÑrºrÄrÅrÆrÔr rrÚONCEÚdictr(Úget)r rØrÙr1rBr’rrrÚ__parse_itemásf ÿ€ ÿ€ ÿý ø z)SpeechLLMAudioTextCollection.__parse_item)Nr’r—) rrrrr r(rrrrrrÕr*rrr$rrÏ˜sûþýüû&CrÏcsˆeZdZdZejdddZ ddeedee d ee eefd eee dee dee d eede de f‡fdd„ Z‡ZS)ÚSpeechLabelz6List of audio-label correspondence with preprocessing.ÚSpeechLabelEntityz audio_file duration label offsetrINFrMrNÚlabelsrOrTrUrVrWrXc s¦| ri|_|j} gd}}d} d}t||||ƒD][\}}}}|dur/|dur/||kr/||7}q|dur@|dur@||kr@||7}q| | ||||ƒ¡|durT| |7} d}| rktj tj |¡¡\}}t|ƒd|j|<t|ƒ|krsnq|r†| r~t d¡n|jdd„d |r”t d t|ƒ›d¡nt d|d d›d¡t dt|ƒ›d| d d›d¡t ttdd„|ƒƒƒ|_t d t|ƒt|jƒ¡¡tƒ |¡dS)aŽInstantiates audio-label manifest with filters and preprocessing. Args: audio_files: List of audio files. durations: List of float durations. labels: List of labels. offsets: List of offsets or None. min_duration: Minimum duration to keep entry with (default: None). max_duration: Maximum duration to keep entry with (default: None). max_number: Maximum number of samples to collect. do_sort_by_duration: True if sort samples list by duration. index_by_file_id: If True, saves a mapping from filename base (ID) to index in data. rYTNFrZr^cSr_r`rarcrrrrejrfz&SpeechLabel.__init__..rgúDataset loaded with z( items. The durations were not provided.ú,Filtered duration for loading collection is rjú .2fú hours.z!Dataset successfully loaded with z4 items and total duration provided from manifest is cSr_r`)Úlabel)Úxrrrreurfz+# {} files loaded accounting to # {} labels)rmrrnrr4r5r8rqrrr rrsrtÚsortedÚsetÚmapÚuniq_labelsÚformatrr)r rMrNrìrOrTrUrVrWrXr"r!rvrxÚduration_undefinedrzrbÚcommandr{r€r@r$rrr-sL ÿÿÿzSpeechLabel.__init__r)rrrrr&r'rrr(r„r rƒrr…rr*rrr$rrê%s<þöþýü ûúùø ÷ örêcsXeZdZdZ ddeeeeff‡fdd„ Zdeded eee ffd d„Z ‡ZS) ÚASRSpeechLabelz3`SpeechLabel` collector from structured json files.FNrcsÖggggf\}}} } g}tj||jdD]<}| |d¡| |d¡|s6|d} |s0| ¡n| |¡}n t|dƒ} | g}| | ¡| |d¡| |¡q|rYt |¡|_ t ƒj||| | g|¢Ri|¤ŽdS)aParse lists of audio files, durations and transcripts texts. Args: manifests_files: Either single string file or list of such - manifests to yield items from. is_regression_task: It's a regression task. cal_labels_occurrence: whether to calculate occurence of labels. delimiter: separator for labels strings. *args: Args to pass to `SpeechLabel` constructor. **kwargs: Kwargs to pass to `SpeechLabel` constructor. r®rzrbrñr{N)r r=Ú_ASRSpeechLabel__parse_itemrr¨r„rÇr&ÚCounterÚlabels_occurrencerr)r rÚis_regression_taskÚcal_labels_occurrenceÚ delimiterr°r±rMrNrìrOÚ all_labelsr1rñÚ label_listr$rrr~s $zASRSpeechLabel.__init__rØrÙr.cCsêt |¡}d|vr| d¡|d<nd|vr| d¡|d<ntd|›dƒ‚tj|d|d|d<d|vr [0 1 2 1 0 0 2] In this seq of label , if label do not appear before, assign new relative labels len(pos); else reuse previous assigned relative labels. Args: seq_label (str): A string of a sequence of labels. Return: relative_seq_label (List) : A list of relative sequence of labels unique_labels_in_seq (Set): A set of unique labels in the sequence )r¨rçrrrrôÚkeys)r rÚseqÚconversion_dictÚrelative_seq_labelÚsegÚ convertedÚunique_labels_in_seqrrrr s z,FeatureSequenceLabel.relative_speaker_parser©NF)rrrrr&r'rrr(rrƒr…rr r*rrr$rr Ês$þ ûþýüû0r csbeZdZdZ ddeeeefdeede f‡fdd„ Z d ed edeeeffdd „Z ‡ZS)ÚASRFeatureSequenceLabelz@`FeatureSequenceLabel` collector from asr structured json files.NFrrVrXcsRgg}}tj||jdD]}| |d¡| |d¡q tƒ ||||¡dS)aêParse lists of feature files and sequences of labels. Args: manifests_files: Either single string file or list of such manifests to yield items from. max_number: Maximum number of samples to collect; pass to `FeatureSequenceLabel` constructor. index_by_file_id: If True, saves a mapping from filename base (ID) to index in data; pass to `FeatureSequenceLabel` constructor. r®rrN)r r=Ú_parse_itemrrr)r rrVrXrrr1r$rrr#s z ASRFeatureSequenceLabel.__init__rØrÙr.cCsžt |¡}d|vr| d¡|d<nd|vr| d¡|d<ntd|›dƒ‚tj |d¡|d<d|vr;| d¡|d<ntd|›dƒ‚t|d|dd}|S) NÚfeature_filenamerÚfeature_filepathrz! without proper feature file key.rz without proper seq_label key.)rr)rârãrär7r4r5Ú expanduserrçrrrrr<s& ÿ ÿþz#ASRFeatureSequenceLabel._parse_itemr)rrrrr r(rrrƒr…rrrrr*rrr$rr süþýü&rcsˆeZdZdZejdddZ ddeedee d eed ee dee deed ee dee dee dedef‡fdd„ Z‡ZS)ÚDiarizationLabelzBList of diarization audio-label correspondence with preprocessing.ÚDiarizationLabelEntityz^audio_file duration rttm_file offset target_spks sess_spk_dict clus_spk_digits rttm_spk_digitsrINFrMrNÚ rttm_filesrOÚtarget_spks_listÚsess_spk_dictsÚ clus_spk_listÚ rttm_spk_listrVrWrXcs|ri|_|j}gd} }t||||||||ƒ}|D]=\}}}}}}}}|dur*d}| |||||||||ƒ¡|rOtj tj |¡¡\}}t| ƒd|j|<t| ƒ| krWnq| rj|rbt d¡n| jdd„dt d |¡t d t| ƒ›dt|ƒ›d¡t ƒ | ¡dS) aFInstantiates audio-label manifest with filters and preprocessing. Args: audio_files: List of audio file paths. durations: List of float durations. rttm_files: List of RTTM files (Groundtruth diarization annotation file). offsets: List of offsets or None. target_spks (tuple): List of tuples containing the two indices of targeted speakers for evaluation. Example: [[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)], [(0, 1), (1, 2), (0, 2)], ...] sess_spk_dict (Dict): List of Mapping dictionaries between RTTM speakers and speaker labels in the clustering result. clus_spk_digits (tuple): List of Tuple containing all the speaker indices from the clustering result. Example: [(0, 1, 2, 3), (0, 1, 2), ...] rttm_spkr_digits (tuple): List of tuple containing all the speaker indices in the RTTM file. Example: (0, 1, 2), (0, 1), ...] max_number: Maximum number of samples to collect do_sort_by_duration: True if sort samples list by duration index_by_file_id: If True, saves a mapping from filename base (ID) to index in data. rYNrrZr^cSr_r`rarcrrrreºrfz+DiarizationLabel.__init__..rgú/Filtered duration for loading collection is %f.úTotal ú& session files loaded accounting to # ú audio clips)rmrrnrr4r5r8rqrrr rrsrtrr)r rMrNr!rOr"r#r$r%rVrWrXr"r!rvÚzipped_itemsrzrbÚ rttm_filer{Útarget_spksÚ sess_spk_dictÚclus_spk_digitsÚrttm_spk_digitsr€r@r$rrrbs^) ÿ÷øÿ ÿþ zDiarizationLabel.__init__©NFF)rrrrr&r'rrr(r„Útuplerrrƒr…rr*rrr$rrZs@þôþýüûúùø ÷ öõôrc sxeZdZdZ ddeeeefdededef‡fdd „ Z ddedefd d„Z dededeeeffdd„Z‡Z S)ÚDiarizationSpeechLabelzP`DiarizationLabel` diarization data sample collector from structured json files.éFrÚemb_dictÚclus_label_dictÚround_digitsc# sd||_||_||_||_||_ggggggggf\} } }}} }}}tj||jdD]ò}|jrrtt t dd„||dDƒƒƒƒ}|drmt|j ¡ƒ}|j||dd}dd„| ¡Dƒ}d d„| ¡Dƒ}|jrl|}ned }d }n`g}t|ddƒ#}| ¡D]}|j|dd \}}}| d |||¡¡q€Wd ƒn1s¡wYt ƒ}|D] }| ¡d}| |¡q«tt |ƒƒ} dd„t| ƒDƒ}t| ¡ƒ}!|!}|!}t|ƒdkrÜdg}"n dd„t|dƒDƒ}"|"D]2}!| |d¡| |d¡| |d¡| |d¡| |!¡| |¡| |¡| |¡qèq)tƒj| | ||| |||g|¢Ri|¤Žd S)a Parse lists of audio files, durations, RTTM (Diarization annotation) files. Since the diarization model infers only two speakers, speaker pairs are generated from the total number of speakers in the session. Args: manifest_filepath (str): Path to input manifest JSON files. emb_dict (Dict): Dictionary containing cluster-average embeddings and speaker mapping information. clus_label_dict (Dict): Segment-level speaker labels from clustering results. round_digit (int): Number of digits to round. seq_eval_mode (bool): If True, F1 score will be calculated for each speaker pair during inference mode. pairwise_infer (bool): If True, this dataset class operates in inference mode. In inference mode, a set of speakers in the input audio is split into multiple pairs of speakers and speaker tuples (e.g., 3 speakers: [(0,1), (1,2), (0,2)]) and then fed into the diarization system to merge the individual results. *args: Args to pass to `SpeechLabel` constructor. **kwargs: Kwargs to pass to `SpeechLabel` constructor. r®cSsg|]}|d‘qS©r3r©r0ròrrrr¿þrÀz3DiarizationSpeechLabel.__init__..Úuniq_idr+rmcSs$i|]\}}t| d¡dƒ|“qS)r@éÿÿÿÿ©rƒr¨©r0ÚkÚvrrrÚ s$z3DiarizationSpeechLabel.__init__..cSs"g|] \}}t| d¡dƒ‘qS)r@rZr;r<rrrr¿s"Nr3r¥)Údecimalsz{} {} {}r:cSsi|]\}}||“qSrr)r0rhÚvalrrrr?sr3)rrZcSsg|]}|‘qSrrr8rrrr¿srzrbr{)r6r4r5Ú seq_eval_modeÚpairwise_inferr r=Ú(_DiarizationSpeechLabel__parse_item_rttmrór<rôrÂrÚitemsr>r?Úsplit_rttm_linerr÷r¨ÚaddÚ enumerater1rrrrr)#r rr4r5r6rBrCr°r±rMrNr!rOr"r#r$r%r1Úclus_speaker_digitsÚbase_scale_indexÚ_sess_spk_dictr-Úrttm_speaker_digitsÚrttm_labelsrBrØÚstartÚendr|Úspeaker_setÚ rttm_lineÚspk_strÚspeaker_listr,Ú spk_comb_listr$rrrÈsŒ#ø"€þÿ ø ø ÷ özDiarizationSpeechLabel.__init__r¥rQr@cCsT| ¡ ¡}tt|dƒ|ƒ}tt|dƒ|ƒtt|dƒ|ƒ}|d}|||fS)a‹ Convert a line in RTTM file to speaker label, start and end timestamps. An example line of `rttm_line`: SPEAKER abc_dev_0123 1 146.903 1.860 speaker543 The above example RTTM line contains the following information: session name: abc_dev_0123 segment start time: 146.903 segment duration: 1.860 speaker label: speaker543 Args: rttm_line (str): A line in RTTM formatted file containing offset and duration of each segment. decimals (int): Number of digits to be rounded. Returns: start (float): Start timestamp in floating point number. end (float): End timestamp in floating point number. speaker (str): speaker string in RTTM lines. r¥éé)rÔr¨Úroundr„)r rQr@ÚrttmrNrOr|rrrrF7s $ z&DiarizationSpeechLabel.split_rttm_linerØrÙr.c CsÄt |¡}d|vr| d¡|d<nd|vr| d¡|d<ntd|›dƒ‚tj |d¡|d<tj tj |d¡¡d|d<d|vrKtd|›d ƒ‚t |d|d|d|d | dd¡d }|S)ú2Parse each rttm file and save it to in Dict formatrÚrzrÛrrrr9rbrÚ rttm_filepathr{N©rzr9rbr+r{)rârãrär7r4r5rr8rqrçrèrrrrÚ__parse_item_rttmXs( ÿ ûz(DiarizationSpeechLabel.__parse_item_rttm)r3FF)r¥)rrrrr r(rrrƒrrFrrDr*rrr$rr2Ås ùþýüûo&!r2cspeZdZdZejdddZ ddeedeed ee d eedee de ed edef‡fdd„ Z ‡ZS)ÚEndtoEndDiarizationLabelzMList of end-to-end diarization audio-label correspondence with preprocessing.r z,audio_file uniq_id duration rttm_file offsetrINFrMÚuniq_idsrNr!rOrVrWrXc s"|ri|_|j} gd} }t|||||ƒ}|D]I\} }}}}|dur$d}| | | ||||ƒ¡|rXt| tƒrCt| ƒdkrCtd| ›ƒ‚tj tj | ¡¡\}}t| ƒd|j|<t| ƒ|kr`nq|rs|rkt d¡n| jdd„d t d |¡t dt| ƒ›dt|ƒ›d ¡tƒ | ¡dS)a÷ Instantiates audio-label manifest with filters and preprocessing. This method initializes the EndtoEndDiarizationLabel object by processing the input data and applying optional filters and sorting. Args: audio_files (List[str]): List of audio file paths. uniq_ids (List[str]): List of unique identifiers for each audio file. durations (List[float]): List of float durations for each audio file. rttm_files (List[str]): List of RTTM path strings (Groundtruth diarization annotation file). offsets (List[float]): List of offsets or None for each audio file. max_number (Optional[int]): Maximum number of samples to collect. Defaults to None. do_sort_by_duration (bool): If True, sort samples list by duration. Defaults to False. index_by_file_id (bool): If True, saves a mapping from filename base (ID) to index in data. Defaults to False. rYNrzEmpty audio file list: rZr^cSr_r`rarcrrrreÀrfz3EndtoEndDiarizationLabel.__init__..rgr&r'r(r))rmrrnrrpr<rrr7r4r5r8rqr rrsrtrr)r rMr^rNr!rOrVrWrXr"r!rvr*rzr9rbr+r{r€r@r$rrrysT úûÿ ÿþ z!EndtoEndDiarizationLabel.__init__r0©rrrrr&r'rrr(r„rrƒr…rr*rrr$rr]qs4þ÷þýüûúùø ÷r]csXeZdZdZ ddeeeefdef‡fdd„ Zdeded e ee ffd d„Z‡ZS) ÚEndtoEndDiarizationSpeechLabelzPEnd-to-end speaker diarization data sample collector from structured json files.r3rr6cs ||_gggggf\}}}}} tj||jdD]%} | | d¡| | d¡| | d¡| | d¡| | d¡qtƒj||||| g|¢Ri|¤ŽdS)a$ Parse lists of audio files, durations, RTTM (Diarization annotation) files. Since diarization model infers only two speakers, speaker pairs are generated from the total number of speakers in the session. Args: manifest_filepath (str): Path to input manifest json files. round_digit (int): Number of digits to be rounded. *args: Args to pass to `SpeechLabel` constructor. **kwargs: Kwargs to pass to `SpeechLabel` constructor. r®rzr9rbr+r{N)r6r r=Ú0_EndtoEndDiarizationSpeechLabel__parse_item_rttmrrr)r rr6r°r±rMr^rNr!rOr1r$rrrÎs0ûûú ùz'EndtoEndDiarizationSpeechLabel.__init__rØrÙr.c Cst |¡}d|vs|ddurd|d<d|vr d|vr$| d¡|d<nd|vr0| d¡|d<ntd|›dƒ‚t|dtƒrT|dD]}t t||d ¡qCt|d<n0t|dt ƒrwt|d|d |d<t j |d¡svt d |d›ƒ‚n td|›d|d›dƒ‚d |vr‰nd|vr•| d¡|d <nd|vr¡| d¡|d <nd|d <|d durÆt|d |d |d <t j |d ¡sÆt d|d ›ƒ‚d|vrÚt j t j |d¡¡d|d<t|dt ƒsétd|›dƒ‚d|vrõtd|›dƒ‚t|d|d|d|d | dd¡d}|S)rYr{NrrzrÚrÛrrrÜzAudio file not found: z" without proper audio file value: r§r+Ú rttm_filenamerZzRTTM file not found: r9z without proper uniq_id key.rbrr[)rârãrär7rpr<Úaudio_file_listrrr(r4r5r6ÚFileNotFoundErrorr8rqrçrè)r rØrÙr1Úsingle_audio_filerrrr\ýsj ÿ ÿÿÿÿ ûz0EndtoEndDiarizationSpeechLabel.__parse_item_rttmr7) rrrrr r(rrƒrrrrar*rrr$rr`Ësýþý&/r`cs~eZdZdZejdddZ ddeee e fdee dee d ee d ee dee deed e f‡fdd„ Z‡ZS)ÚAudioz8Prepare a list of all audio items, filtered by duration.z audio_files duration offset textrINFÚaudio_files_listÚ duration_listÚoffset_listÚ text_listrTrUrVrWc sð|j} gd} }d\}} t||||ƒD]>\}}}}|dur*||kr*| |7} |d7}q|dur;||kr;| |7} |d7}q||7}| | ||||ƒ¡t| ƒ|krQnq|r\| jdd„dt dt| ƒ|d ¡t d || d ¡tƒ | ¡dS)aLInstantiantes an list of audio files. Args: audio_files_list: list of dictionaries with mapping from audio_key to audio_filepath duration_list: list of durations of input files offset_list: list of offsets text_list: list of texts min_duration: Minimum duration to keep entry with (default: None). max_duration: Maximum duration to keep entry with (default: None). max_number: Maximum number of samples to collect. do_sort_by_duration: True if sort samples list by duration. rY)rrYNrZcSr_r`rarcrrrrexrfz Audio.__init__..rgrirjrk) rrnrrrrsr rtrr)r rgrhrirjrTrUrVrWr"r!rxrwrvrMrbr{r#r$rrrHs, ÿzAudio.__init__)NNNF)rrrrr&r'rrrr(r„rrƒr…rr*rrr$rrfCs0÷þýüûúùø ÷rfcs\eZdZdZdeeeefdeeeff‡fdd„Zdededeee ffd d „Z ‡ZS)ÚAudioCollectionz)List of audio files from a manifest file.Úmanifest_filesÚaudio_to_manifest_keycsÖt|ƒtkr| d¡}| ¡D]\}}t|ƒtkr$d|vr$| d¡||<q||_ggggf\}}} } tj||jdD]}| |d¡| |d¡| |d¡| |d¡q:t ƒj ||| | g|¢Ri|¤ŽdS)aInstantiates a list of audio files loaded from a manifest file. Args: manifest_files: path to a single manifest file or a list of paths audio_to_manifest_key: dictionary mapping audio signals to keys of the manifest rÓr®rMrbr{r#N)Útyper(r¨rErmr r=Ú_AudioCollection__parse_itemrrr)r rlrmr°r±Ú audio_keyÚmanifest_keyrgrhrirjr1r$rrrƒs €$zAudioCollection.__init__rØrÙr.c sdtdttttffdd„}t |¡}i}|j ¡D]A\}}|||ƒ}t|tƒr2t |ˆ¡||<qt|tƒrC‡fdd„|Dƒ||<q|durQ| d¡rQd||<qt d t|ƒ›d |›ƒ‚||d<d|vrot d |›dˆ›ƒ‚d|vrwd|d<d|vrd|d<t|d|d|d|ddS)a;Parse a single line from a manifest file. Args: line: a string representing a line from a manifest file in JSON format manifest_file: path to the manifest file. Used to resolve relative paths. Returns: Dictionary with audio_files, duration, and offset. r1rqcSs¨|durd}|St|tƒr||}|St|tƒrHg}|D])}||}t|tƒr-| |¡qt|tƒr7||7}qtdt|ƒ›d|›d|›ƒ‚|Stdt|ƒ›d|›ƒ‚)z{Get item[key] if key is string, or a list of strings by combining item[key[0]], item[key[1]], etc. NúUnexpected type z of item for key z: z of manifest_key: )rpr(rrr<r7rn)r1rqrzrhÚitem_keyrrrÚget_audio_file´s" ï ò þz4AudioCollection.__parse_item..get_audio_filecsg|]}t |ˆ¡‘qSr)r r)r0rB©rÙrrr¿ßsz0AudioCollection.__parse_item..Nrrrz of audio_file: rMrbz Duration not available in line: z. Manifest file: r{rYr#)rMrbr{r#)rr r(rrârãrmrErpr rrÚ startswithr7rnrç) r rØrÙrtr1rMrprqrzrrurré¨s, ÿzAudioCollection.__parse_item)rrrrr r(rrrrror*rrr$rrk€sþ ý&%rkcsteZdZdZejdddZ ddeedeed ee d e e de e de ed edef‡fdd„ Z ‡ZS)ÚFeatureLabelzKList of feature sequence and their label correspondence with preprocessing.ÚFeatureLabelEntityzfeature_file label durationrINFrrìrNrTrUrVrWrXc sd|j} g} d}d}tƒ|_|ri|_t|||ƒD]T\} }}|dur*||kr*||7}q|dur7||kr7||7}q| | | ||ƒ¡|jt|ƒO_||7}|rdtj tj | ¡¡\}}t | ƒd|j|<t | ƒ|krlnq|r|rwt d¡n| j dd„dt d|d d ›d¡t dt | ƒ›d |dd›d¡t d t | ƒt |jƒ¡¡tƒ | ¡dS)aXInstantiates feature-SequenceLabel manifest with filters and preprocessing. Args: feature_files: List of feature files. labels: List of labels. max_number: Maximum number of samples to collect. index_by_file_id: If True, saves a mapping from filename base (ID) to index in data. rYNrZr^cSr_r`rarcrrrre7rfz'FeatureLabel.__init__..rgrîi( z.2frðríz items, total duration of rjrïz.# {} files loaded including # {} unique labels)rrôrörmrnrr4r5r8rqrrr rrsrtr÷rr)r rrìrNrTrUrVrWrXr"r!rvrxrrñrbr€r@r$rrrs>ÿ"zFeatureLabel.__init__rr_rrr$rrwøs8þ ÷þýüûúùø ÷rwc sheZdZdZ ddeeeefdededeef‡fdd „ Z d edede eeffd d„Z‡Z S)ÚASRFeatureLabelz8`FeatureLabel` collector from asr structured json files.FNrrþrÿrcsÂggg}}} g} tj||jdD]5}| |d¡| |d¡|s4|d}|s.| ¡n| |¡} n t|dƒ}|g} | |¡| | ¡q|rPt | ¡|_ t ƒj||| g|¢Ri|¤ŽdS)aÛParse lists of feature files and sequences of labels. Args: manifests_files: Either single string file or list of such - manifests to yield items from. max_number: Maximum number of samples to collect; pass to `FeatureSequenceLabel` constructor. index_by_file_id: If True, saves a mapping from filename base (ID) to index in data; pass to `FeatureSequenceLabel` constructor. r®rrbrñN)r r=rrr¨r„rÇr&rürýrr)r rrþrÿrr°r±rrìrNrr1rñrr$rrrBs "zASRFeatureLabel.__init__rØrÙr.cCs®t |¡}d|vr| d¡|d<nd|vr| d¡|d<nd|vr)td|›dƒ‚tj|d|d|d<d|vr@| d¡|d<ntd|›dƒ‚t|d|d|d d }|S)Nrrrrz# without proper 'feature_file' key.rÜrñz without proper 'label' key.rb)rrñrb)rârãrär7r rrçrrrrris ÿzASRFeatureLabel._parse_itemr)rrrrr r(rr…rrrrrr*rrr$rry?sûþýüû&'ryc!sÂeZdZdZejdddZ ddeedee d ee d ee dee dee d eeedeeedeeedeee dej dee dee deededef ‡fdd„ Z‡ZS)ÚFeatureTextrGÚFeatureTextEntityzSid feature_file rttm_file duration text_tokens offset text_raw speaker orig_sr langrINFrLrr!rNrrOrPrQrRrSrrTrUrVrWrXc"sì|j}gdddf\}}}}|ri|_t||||||||| | ƒ D]¨\ }}}}}}}}} }|dur<||kr<||7}|d7}q| durM|| krM||7}|d7}q| durT| }n3|dkrxt|dƒrs|jrst|tƒrs|duro|||ƒ}ntdƒ‚||ƒ}ng}|dur‡||7}|d7}q||7}| |||||||||||ƒ ¡|r¿t j t j |¡¡\} }!| |jvr³g|j| <|j| t |ƒd¡t |ƒ|krÇnq|rÚ|rÒt d¡n|jd d „dt dt |ƒ|d ¡t d||d ¡tƒ |¡dS)aKInstantiates feature-text manifest with filters and preprocessing. Args: ids: List of examples positions. feature_files: List of audio feature files. rttm_files: List of audio rttm files. durations: List of float durations. texts: List of raw text transcripts. offsets: List of duration offsets or None. speakers: List of optional speakers ids. orig_sampling_rates: List of original sampling rates of audio files. langs: List of language ids, one for eadh sample, or None. parser: Instance of `CharParser` to convert string to tokens. min_duration: Minimum duration to keep entry with (default: None). max_duration: Maximum duration to keep entry with (default: None). max_number: Maximum number of samples to collect. do_sort_by_duration: True if sort samples list by duration. Not compatible with index_by_file_id. index_by_file_id: If True, saves a mapping from filename base (ID) to index in data. rYrNrZr[r\r]r^cSr_r`rarcrrrreórfz&FeatureText.__init__..rgrirjrkrl)"r rLrr!rNrrOrPrQrRrSrrTrUrVrWrXr"r!rvrwrxryÚ feat_filer+rbr{r#r|r}r~rr€r@r$rrrŠsp'ö ÿ ÿzFeatureText.__init__rr‚rrr$rrz‚sXþïþýüûúù ø ÷ ö õô óòñðïrzcrµ)ÚASRFeatureTextz7`FeatureText` collector from asr structured json files.rcsüggggggf\}}}}}} ggggf\} }}} t |¡D]H}| |d¡| |d¡| |d¡| |d¡| |d¡| |d¡| |d¡| |d¡| |d ¡| |d ¡qtƒj|||||| | ||| g |¢Ri|¤ŽdS)rr‘rr+rbr#r{r|r}rRr~Nr¯)r rr°r±rLrr!rNrrOrPr²rRrSr1r$rrrþsTúù öõ ôzASRFeatureText.__init__r·rrr$rr}ûr¸r})4r&râr4Ú itertoolsrÚtypingrrrrrrr ÚnumpyrÄÚpandasr9Ú+nemo.collections.common.parts.preprocessingr rÚ4nemo.collections.common.parts.preprocessing.manifestrÚ nemo.utilsr rÚUserListrrr+rFr‡rŠr«Úobjectr³r¶r¹rÏrêrúr rrr2r]r`rfrkrwryrzr}rrrrÚsL$ sl+*vVOV:k-Zx=xGCy