o 7wÖi˜Pã@sUdZddlZddlZddlZddlmZddlmZmZm Z m Z ddlZddlm Z ddlmZe eƒZdZd Zd ZdZe d¡Zejed <Gdd„deƒZgddfdd„Zdedee eeeffdd„Zde eefdee eeefddfdd„ZdS)a+Lexicon class and utilities. Provides functions to read/write lexicon files and convert them to k2 ragged tensors. The Lexicon class provides a way to convert a list of words to a ragged tensor containing token IDs. It also stores the lexicon graph which can be used by a graph compiler to decode sequences. This code was adjusted, and therefore heavily inspired or taken from from icefall's (https://github.com/k2-fsa/icefall) Lexicon class and its utility functions. Authors: * Pierre Champion 2023 * Zeyu Zhao 2023 * Georgios Karakasidis 2023 éN)ÚPath)ÚListÚOptionalÚTupleÚUnion)Ú get_loggeré)Úk2zzzzz^#\d+$ÚDISAMBIG_PATTERNc @s"eZdZdZdefdd„Zedeefdd„ƒZ ede jfdd „ƒZd e jfdd„Z d e jde jfdd„Z d%deedeedeeefdd„Z d&deedeeeefdd„Z d&deedeeeeefdd„Z d'deededefdd„Zdd „Zd(d"efd#d$„ZdS))ÚLexiconaE Unit based lexicon. It is used to map a list of words to each word's sequence of tokens (characters). It also stores the lexicon graph which can be used by a graph compiler to decode sequences. Arguments --------- lang_dir: str Path to the lang directory. It is expected to contain the following files: - tokens.txt - words.txt - L.pt Example ------- >>> from speechbrain.k2_integration import k2 >>> from speechbrain.k2_integration.lexicon import Lexicon >>> from speechbrain.k2_integration.graph_compiler import CtcGraphCompiler >>> from speechbrain.k2_integration.prepare_lang import prepare_lang >>> # Create a small lexicon containing only two words and write it to a file. >>> lang_tmpdir = getfixture('tmpdir') >>> lexicon_sample = '''hello h e l l o\nworld w o r l d''' >>> lexicon_file = lang_tmpdir.join("lexicon.txt") >>> lexicon_file.write(lexicon_sample) >>> # Create a lang directory with the lexicon and L.pt, L_inv.pt, L_disambig.pt >>> prepare_lang(lang_tmpdir) >>> # Create a lexicon object >>> lexicon = Lexicon(lang_tmpdir) >>> # Make sure the lexicon was loaded correctly >>> assert isinstance(lexicon.token_table, k2.SymbolTable) >>> assert isinstance(lexicon.L, k2.Fsa) Úlang_dirc sŽt|ƒˆ_}tj |d¡ˆ_tj |d¡ˆ_iˆ_t|dddd:}|D]/}| ¡ ¡d}| ¡ ¡dd…}‡fd d „|Dƒ}|ˆjvrOgˆj|<ˆj| |¡q(Wdƒn1sbwYdˆ_|d ¡r…t d|›d ¡tj t |d¡¡}n t|›d|›ƒ‚|d ¡r©t d|›d¡tj t |d¡¡}nt d¡t | ¡¡}t | ¡|d¡|ˆ_|ˆ_dS)Nz tokens.txtz words.txtúlexicon.txtÚrúutf-8©Úencodingrrcsg|]}ˆj|‘qS©)Útoken_table)Ú.0Út©Úselfrú_/home/ubuntu/sommelier/.venv/lib/python3.10/site-packages/speechbrain/k2_integration/lexicon.pyÚ Zsz$Lexicon.__init__..zL.ptúLoading compiled z/L.ptzM/L.pt does not exist. Please make sure you have successfully created L.pt in zLinv.ptz/Linv.ptzConverting L.pt to Linv.pt)rrr ÚSymbolTableÚ from_filerÚ word_tableÚ word2tokenidsÚopenÚstripÚsplitÚappendÚ_L_disambigÚexistsÚloggerÚinfoÚFsaÚ from_dictÚtorchÚloadÚRuntimeErrorÚarc_sortÚinvertÚsaveÚas_dictÚL_invÚL) rrÚfÚlineÚwordÚtokensÚtidsr1r0rrrÚ__init__Ns> ùÿ ÿÿ zLexicon.__init__ÚreturncCsD|jj}g}|D]}t |¡r|tkr| |j|¡q| ¡|S)zm Return a list of token IDs excluding those from disambiguation symbols and epsilon. )rÚsymbolsr ÚmatchÚEPSr"Úsort)rr9ÚansÚsrrrr5xs€zLexicon.tokenscCsh|jdur1t d|j›d¡|jd ¡r&tj t |jd¡¡|_|jSt |j›d|j›ƒ‚|jS)zl Return the lexicon FSA (with disambiguation symbols). Needed for HLG construction. Nrz/L_disambig.ptz L_disambig.ptz_/L_disambig.pt does not exist. Please make sure you have successfully created L_disambig.pt in )r#r%r&rr$r r'r(r)r*r+rrrrÚ L_disambig†s ÿüÿÿzLexicon.L_disambigÚGcCsd|j|j|jdk<dS)zž Remove the disambiguation symbols of a G graph Arguments --------- G: k2.Fsa The G graph to be modified rú#0N)Úlabelsr)rr@rrrÚ#remove_G_rescoring_disambig_symbols™s z+Lexicon.remove_G_rescoring_disambig_symbolsÚLGcCsd|jd}|jd}t d¡|j ¡}d|||k<||_t|jtj ƒs&J‚d|jj |jj |k<|S)a Remove the disambiguation symbols of an LG graph Needed for HLG construction. Arguments --------- LG: k2.Fsa The LG graph to be modified Returns ------- LG: k2.Fsa The modified LG graph rAz%Removing disambiguation symbols on LGr)rrr%ÚdebugrBÚcloneÚ isinstanceÚ aux_labelsr ÚRaggedTensorÚvalues)rrDÚfirst_token_disambig_idÚfirst_word_disambig_idrBrrrÚremove_LG_disambig_symbols¤s z"Lexicon.remove_LG_disambig_symbolsFNTÚtextsÚsil_token_idcs\|j||dd}|r,ˆdusJdƒ‚tt|ƒƒD]}‡fdd„||Dƒdd…||<q|S)a. Convert a list of texts into word IDs. This method performs the mapping of each word in the input texts to its corresponding ID. The result is a list of lists, where each inner list contains the word IDs for a sentence. If the `add_sil_token_as_separator` flag is True, a silence token is inserted between words, and the `sil_token_id` parameter specifies the ID for the silence token. If a word is not found in the vocabulary, a warning is logged if `log_unknown_warning` is True. Arguments --------- texts: List[str] A list of strings where each string represents a sentence. Each sentence is composed of space-separated words. add_sil_token_as_separator: bool Flag indicating whether to add a silence token as a separator between words. sil_token_id: Optional[int] The ID of the silence token. If not provided, the separator is not added. log_unknown_warning: bool Flag indicating whether to log a warning for unknown words. Returns ------- word_ids: List[List[int]] A list of lists where each inner list represents the word IDs for a sentence. The word IDs are obtained based on the vocabulary mapping. r©Ú_mapperNz7sil_token_id=None while add_sil_token_as_separator=Truecsg|]}|ˆfD]}|‘qqSrr)rÚitemÚx©rOrrrðs ÿÿz-Lexicon.texts_to_word_ids..éÿÿÿÿ)Ú _texts_to_idsÚrangeÚlen)rrNÚadd_sil_token_as_separatorrOÚlog_unknown_warningÚword_idsÚirrTrÚtexts_to_word_idsÃs%ÿ ÿ ÿ þzLexicon.texts_to_word_idscCs|j||ddS)a Convert a list of text sentences into token IDs. Parameters ---------- texts: List[str] A list of strings, where each string represents a sentence. Each sentence consists of space-separated words. Example: ['hello world', 'tokenization with lexicon'] log_unknown_warning: bool Flag indicating whether to log warnings for out-of-vocabulary tokens. If True, warnings will be logged when encountering unknown tokens. Returns ------- token_ids: List[List[List[int]]] A list containing token IDs for each sentence in the input. The structure of the list is as follows: [ [ # For the first sentence [token_id_1, token_id_2, ..., token_id_n], [token_id_1, token_id_2, ..., token_id_m], ... ], [ # For the second sentence [token_id_1, token_id_2, ..., token_id_p], [token_id_1, token_id_2, ..., token_id_q], ... ], ... ] Each innermost list represents the token IDs for a word in the sentence. rrP©rV©rrNrZrrrÚtexts_to_token_idsõs(ÿzLexicon.texts_to_token_idscCs|j||dddS)aÝ Convert a list of input texts to token IDs with multiple pronunciation variants. This method converts input texts into token IDs, considering multiple pronunciation variants. The resulting structure allows for handling various pronunciations of words within the given texts. Arguments --------- texts: List[str] A list of strings, where each string represents a sentence for an utterance. Each sentence consists of space-separated words. log_unknown_warning: bool Indicates whether to log warnings for out-of-vocabulary (OOV) tokens. If set to True, warnings will be logged for OOV tokens during the conversion. Returns ------- token_ids: List[List[List[List[int]]]] A nested list structure containing token IDs for each utterance. The structure is as follows: - Outer List: Represents different utterances. - Middle List: Represents different pronunciation variants for each utterance. - Inner List: Represents the sequence of token IDs for each pronunciation variant. - Innermost List: Represents the token IDs for each word in the sequence. rT)rQÚ_multiple_pronunciationr^r_rrrÚ.texts_to_token_ids_with_multiple_pronunciation!süz6Lexicon.texts_to_token_ids_with_multiple_pronunciationrZrQc Csº|jt}|dkr|jtg}t||ƒ}g}|D]B}g} | ¡} t| ƒD]0\}}||vrA||} t| tƒr;|s;| d} | | ¡q$| |¡|rTt d|›d|›d¡q$| | ¡q|S)a® Convert a list of texts to a list of IDs, which can be either word IDs or a list of token IDs. Arguments --------- texts: List[str] A list of strings where each string consists of space-separated words. Example: ['hello world', 'tokenization with lexicon'] log_unknown_warning: bool Log a warning if a word is not found in the token-to-IDs mapping. _mapper: str The mapper to use, either "word_table" (e.g., "TEST" -> 176838) or "word2tokenids" (e.g., "TEST" -> [23, 8, 22, 23]). _multiple_pronunciation: bool Allow returning all pronunciations of a word from the lexicon. If False, only return the first pronunciation. Returns ------- ids_list: List[List[int] or int] Returns a list-of-list of word IDs or a list of token IDs. rrzCannot find word z in the mapper zG. Replacing it with OOV token. Note that it is fine if you are testing.)rÚUNKrÚUNK_tÚgetattrr!Ú enumeraterGÚlistr"r%Úwarning)rrNrZrQraÚoov_token_idÚidsÚids_listÚtextr[Úwordsr\r4ÚidwordrrrrVFs0 " ÿ ÿ€zLexicon._texts_to_idscCs<t |j¡|_t |j¡|_|jdurt |j¡|_dSdS)z@ Sort L, L_inv, L_disambig arcs of every state. N)r r,r1r0r#rrrrr,…s ÿzLexicon.arc_sortÚcpuÚdevicecCs<|j |¡|_|j |¡|_|jdur|j |¡|_dSdS)z‹ Device to move L, L_inv and L_disambig to Arguments --------- device: str The device N)r1Útor0r#)rrprrrrqŽs ÿz Lexicon.to)FNT)T)F)ro)Ú__name__Ú __module__Ú__qualname__Ú__doc__rr7ÚpropertyrÚintr5r r'r?rCrMÚstrrr]r`rbÚboolrVr,rqrrrrr*sV# þ* "ûþü ú5ýþ ü/ýþ ü*ûþý ü? rÚwrdTc Csªtƒ}t|ƒdkrP|D]D}t|ddd3}t |¡}|D]#} | | ¡} | D]}||vr>|r8t|ƒtg||<q&t|ƒ||<q&qWdƒn1sJwYq|D];}t|dd+}|D] }| ¡ ¡d}||vr}|rwt|ƒtg||<q]t|ƒ||<q]Wdƒn1sˆwYqRt j |ddtt j |d¡d dd*}t ›d t›d} |D]}| |d d ||¡d7} q¬| | ¡WdƒdS1sÎwYdS)aÖ Read extra_csv_files to generate a $lang_dir/lexicon.txt for k2 training. This usually includes the csv files of the training set and the dev set in the output_folder. During training, we need to make sure that the lexicon.txt contains all (or the majority of) the words in the training set and the dev set. NOTE: This assumes that the csv files contain the transcription in the last column. Also note that in each csv_file, the first line is the header, and the remaining lines are in the following format: ID, duration, wav, spk_id, wrd (transcription) We only need the transcription in this function. Writes out $lang_dir/lexicon.txt Note that the lexicon.txt is a text file with the following format: word1 phone1 phone2 phone3 ... word2 phone1 phone2 phone3 ... In this code, we simply use the characters in the word as the phones. You can use other phone sets, e.g., phonemes, BPEs, to train a better model. Arguments --------- lang_dir: str The directory to store the lexicon.txt vocab_files: List[str] A list of extra vocab files. For example, for librispeech this could be the librispeech-vocab.txt file. extra_csv_files: List[str] A list of csv file paths column_text_key: str The column name of the transcription in the csv file. By default, it is "wrd". add_word_boundary: bool whether to add word boundary symbols at the end of each line to the lexicon for every word. Example ------- >>> from speechbrain.k2_integration.lexicon import prepare_char_lexicon >>> # Create some dummy csv files containing only the words `hello`, `world`. >>> # The first line is the header, and the remaining lines are in the following >>> # format: >>> # ID, duration, wav, spk_id, wrd (transcription) >>> csv_file = getfixture('tmpdir').join("train.csv") >>> # Data to be written to the CSV file. >>> import csv >>> data = [ ... ["ID", "duration", "wav", "spk_id", "wrd"], ... [1, 1, 1, 1, "hello world"], ... [2, 0.5, 1, 1, "hello"] ... ] >>> with open(csv_file, "w", newline="", encoding="utf-8") as f: ... writer = csv.writer(f) ... writer.writerows(data) >>> extra_csv_files = [csv_file] >>> lang_dir = getfixture('tmpdir') >>> vocab_files = [] >>> prepare_char_lexicon(lang_dir, vocab_files, extra_csv_files=extra_csv_files, add_word_boundary=False) rrrrNT)Úexist_okr Úwú Ú )ÚdictrXrÚcsvÚ DictReaderr!rgÚEOWr ÚosÚmakedirsÚpathÚjoinrcrdÚwrite)rÚvocab_filesÚextra_csv_filesÚcolumn_text_keyÚadd_word_boundaryÚlexiconÚfiler2Ú csv_readerÚrowrmr4r3ÚfcrrrÚprepare_char_lexiconsNF €ûýþ€€øÿ€ÿ "úr‘Úfilenamer8c CsÐg}t|dddU}t d¡}|D]D}| | d¡¡}t|ƒdkr"qt|ƒdkr3td|›d |›d ƒ‚|d}|tkrHtd|›d |›t›dƒ‚|dd …}| ||f¡qWd ƒ|S1sawY|S)a” Read a lexicon from `filename`. Each line in the lexicon contains "word p1 p2 p3 ...". That is, the first field is a word and the remaining fields are tokens. Fields are separated by space(s). Arguments --------- filename: str Path to the lexicon.txt Returns ------- ans: A list of tuples., e.g., [('w', ['p1', 'p2']), ('w1', ['p3, 'p4'])] rrrz[ ]+z rézFound bad line z in lexicon file z3Every line is expected to contain at least 2 fieldsz should not be a valid wordrN) rÚreÚcompiler!r rXr+r;r")r’r=r2Ú whitespacer3Úar4r5rrrÚread_lexicons2 ÿÿÿð þír˜rŒc Cs^t|ddd}|D]\}}| |›dd |¡›d¡q WdƒdS1s(wYdS)zê Write a lexicon to a file. Arguments --------- filename: str Path to the lexicon file to be generated. lexicon: List[Tuple[str, List[str]]] It can be the return value of :func:`read_lexicon`. r|rrr}r~N)rr‡r†)r’rŒr2r4r5rrrÚ write_lexicon2s ÿ"ÿr™) rur€rƒr”ÚpathlibrÚtypingrrrrr)Úspeechbrain.utils.loggerrÚr rrr%rcrdr‚r;r•r ÚPatternÚ__annotations__Úobjectrr‘rxr˜r™rrrrÚs@ÿx û"k* ÿÿþ