o ½e¦i%Fã@s4dZddlZddlmZddlZddlmZddlZddlm Z e d¡Zdd„Zd:d d„Z d;dd„Zdd„Zdd„Zdd„Zdd„Zdd8d9„Z$dS)?z– Data pipeline elements for the G2P pipeline Authors * Loren Lugosch 2020 * Mirco Ravanelli 2020 * Artem Ploujnikov 2021 (minor refactoring only) éN)Úreduce)Únn)Úexpand_to_charsz\s{2,}cs0| ¡}d ‡fdd„|Dƒ¡}t d|¡}|S)aM Cleans incoming text, removing any characters not on the accepted list of graphemes and converting to uppercase Arguments --------- txt: str the text to clean up graphemes: list a list of graphemes Returns ------- item: DynamicItem A wrapped transformation function Úc3s|] }|ˆvr|VqdS©N©©Ú.0Úchar©Ú graphemesrúa/home/ubuntu/transcripts/venv/lib/python3.10/site-packages/speechbrain/lobes/models/g2p/dataio.pyÚ (s€z!clean_pipeline..ú )ÚupperÚjoinÚRE_MULTI_SPACEÚsub)ÚtxtrÚresultrrr Úclean_pipelinesrTc#sJ|r| ¡}‡fdd„|Dƒ}|Vˆ |¡}|Vt |¡}|VdS)aEncodes a grapheme sequence Arguments --------- char: str A list of characters to encode. grapheme_encoder: speechbrain.dataio.encoder.TextEncoder a text encoder for graphemes. If not provided, uppercase: bool whether or not to convert items to uppercase Yields ------ grapheme_list: list a raw list of graphemes, excluding any non-matching labels grapheme_encoded_list: list a list of graphemes encoded as integers grapheme_encoded: torch.Tensor csg|] }|ˆjvr|‘qSr)Úlab2ind)r Úgrapheme©Úgrapheme_encoderrr Ú Dsz%grapheme_pipeline..N)rÚencode_sequenceÚtorchÚ LongTensor)r rÚ uppercaseÚ grapheme_listÚgrapheme_encoded_listÚgrapheme_encodedrrr Úgrapheme_pipeline-s€ ÿ r#réc#st‡fdd„|Dƒ}|Vd |durt||ƒn|¡}|r&t|ƒ|||ƒ} n|ƒj |¡} | Vt | ¡} | VdS)aÉA pipeline element that uses a pretrained tokenizer Arguments --------- seq: list List of tokens to encode. tokenizer: speechbrain.tokenizer.SentencePiece a tokenizer instance tokens: str available tokens wordwise: str whether tokenization is performed on the whole sequence or one word at a time. Tokenization can produce token sequences in which a token may span multiple words word_separator: str The substring to use as a separator between words. token_space_index: int the index of the space token char_map: dict a mapping from characters to tokens. This is used when tokenizing sequences of phonemes rather than sequences of characters. A sequence of phonemes is typically a list of one or two-character tokens (e.g. ["DH", "UH", " ", "S", "AW", "N", "D"]). The character map makes it possible to map these to arbitrarily selected characters Yields ------ token_list: list a list of raw tokens encoded_list: list a list of tokens, encoded as a list of integers encoded: torch.Tensor a list of tokens, encoded as a tensor csg|]}|ˆvr|‘qSrr©r Útoken©Útokensrr rzóz-tokenizer_encode_pipeline..rN)rÚ_map_tokens_itemÚ_wordwise_tokenizeÚspÚ encode_as_idsrr)ÚseqÚ tokenizerr(ÚwordwiseÚword_separatorÚtoken_space_indexÚchar_mapÚ token_listÚtokenizer_inputÚencoded_listÚencodedrr'r Útokenizer_encode_pipelineNs €, ÿý ÿ r8csL||vr ˆj |¡Stt||ƒƒ}‡fdd„|Dƒ}|g‰t‡fdd„|ƒS)a´Tokenizes a sequence wordwise Arguments --------- tokenizer: speechbrain.tokenizers.SentencePiece.SentencePiece a tokenizer instance sequence: iterable the original sequence input_separator: str the separator used in the input sequence token_separator: str the token separator used in the output sequence Returns ------- result: str the resulting tensor cóg|]}ˆj |¡‘qSr)r,r-©r Úword_tokens©r/rr r¤óÿz&_wordwise_tokenize..cs|ˆ|Srr)ÚleftÚright)Úsep_listrr Ú¨sz$_wordwise_tokenize..)r,r-ÚlistÚ_split_listr)r/ÚsequenceÚinput_separatorÚtoken_separatorÚwordsÚ encoded_wordsr)r@r/r r+s ÿr+csjt|tƒr|dkrdS||vr t|tƒr|n| ¡}ˆj |¡Stt||ƒƒ}‡fdd„|Dƒ}| |¡S)a·Detokenizes a sequence wordwise Arguments --------- tokenizer: speechbrain.tokenizers.SentencePiece.SentencePiece a tokenizer instance sequence: iterable the original sequence output_separator: str the separator used in the output sequence token_separator: str the token separator used in the output sequence Returns ------- result: torch.Tensor the result rcr9r©r,Ú decode_idsr:r<rr rÈr=z(_wordwise_detokenize..)Ú isinstanceÚstrrBÚtolistr,rJrCr)r/rDÚoutput_separatorrFÚ sequence_listrGrHrr<r Ú_wordwise_detokenize«sÿ ÿ rPccsh|dur0d}t|ƒD]\}}||kr||d|…V|}q||dkr2||dd…VdSdSdS)zõ Splits a sequence (such as a tensor) by the specified separator Arguments --------- items: sequence any sequence that supports indexing separator: str the separator token Yields ------ item Néÿÿÿÿé)Ú enumerate)ÚitemsÚ separatorÚlast_idxÚidxÚitemrrr rCÎs€€ùrCcCsx|dur tjj ¡}||krd|jvr|jdd|dnd|jvr*|jdd||dd|jvr3| ¡|j|dd |S) a' Initializes the phoneme encoder with EOS/BOS sequences Arguments --------- tokens: list a list of tokens encoder: speechbrain.dataio.encoder.TextEncoder. a text encoder instance. If none is provided, a new one will be instantiated bos_index: int the position corresponding to the Beginning-of-Sentence token eos_index: int the position corresponding to the End-of-Sentence Returns ------- encoder: speechbrain.dataio.encoder.TextEncoder an encoder Nz )Ú bos_labelÚ eos_labelÚ bos_indexzz)rYrZr[Ú eos_indexzF)Úsequence_input)ÚsbÚdataioÚencoderÚTextEncoderrÚinsert_bos_eosÚadd_unkÚupdate_from_iterable)r(r`r[r\rrr Úenable_eos_bosçs* ý€ ü reccs,|V| |¡}|Vt |¡}|VdS)aîEncodes a sequence of phonemes using the encoder provided Arguments --------- phn: list List of phonemes phoneme_encoder: speechbrain.datio.encoder.TextEncoder a text encoder instance (optional, if not provided, a new one will be created) Yields ------ phn: list the original list of phonemes phn_encoded_list: list encoded phonemes, as a list phn_encoded: torch.Tensor encoded phonemes, as a tensor N)rrr)ÚphnÚphoneme_encoderÚphn_encoded_listÚphn_encodedrrr Úphoneme_pipelines€ rjccsv| |¡}t |¡st |¡}| ¡Vt t|ƒ¡V| |¡}t |¡s,t |¡}| ¡Vt t|ƒ¡VdS)a}Adds BOS and EOS tokens to the sequence provided Arguments --------- seq: torch.Tensor the source sequence encoder: speechbrain.dataio.encoder.TextEncoder an encoder instance Yields ------ seq_eos: torch.Tensor the sequence, with the EOS token added seq_bos: torch.Tensor the sequence, with the BOS token added N)Úprepend_bos_indexrÚ is_tensorÚtensorÚlongÚlenÚappend_eos_index)r.r`Úseq_bosÚseq_eosrrr Úadd_bos_eos1s€ rscCs |||ƒS)aPerforms a Beam Search on the phonemes. This function is meant to be used as a component in a decoding pipeline Arguments --------- char_lens: torch.Tensor the length of character inputs encoder_out: torch.Tensor Raw encoder outputs beam_searcher: speechbrain.decoders.seq2seq.S2SBeamSearcher a SpeechBrain beam searcher instance Returns ------- hyps: list hypotheses scores: list confidence scores associated with each hypotheses r)Ú char_lensÚencoder_outÚ beam_searcherrrr Úbeam_search_pipelineNs rwcCó | |¡S)a#Decodes a sequence of phonemes Arguments --------- hyps: list hypotheses, the output of a beam search phoneme_encoder: speechbrain.datio.encoder.TextEncoder a text encoder instance Returns ------- phonemes: list the phoneme sequence ©Údecode_ndim)Úhypsrgrrr Úphoneme_decoder_pipelinees r|cCs dd„tt|ƒt|ƒdƒDƒS)z÷Produces a list of consecutive characters Arguments --------- start_char: str the starting character end_char: str the ending characters Returns ------- char_range: str the character range cSsg|]}t|ƒ‘qSr©Úchr)r rWrrr r†ózchar_range..rR)ÚrangeÚord)Ú start_charÚend_charrrr Ú char_rangews r„cCsLtddƒtddƒ}ttdd„|ƒƒ}tt||dt|ƒ…ƒƒ}d|d<|S) awBuilds a map that maps arbitrary tokens to arbitrarily chosen characters. This is required to overcome the limitations of SentencePiece. Arguments --------- tokens: list a list of tokens for which to produce the map Returns ------- token_map: dict a dictionary with original tokens as keys and new mappings as values ÚAÚZÚaÚzcSs|dkS)Nrrr}rrr rA™sz&build_token_char_map..Nr)r„rBÚfilterÚdictÚzipro)r(ÚcharsÚvaluesÚ token_maprrr Úbuild_token_char_map‰s rcCsdd„| ¡DƒS)zÙExchanges keys and values in a dictionary Arguments --------- map_dict: dict a dictionary Returns ------- reverse_map_dict: dict a dictionary with keys and values flipped cSsi|]\}}||“qSrr)r ÚkeyÚvaluerrr Ú ¬szflip_map..)rT)Úmap_dictrrr Úflip_mapŸs r”cCrx)aDDecodes a sequence using a tokenizer. This function is meant to be used in hparam files Arguments --------- seq: torch.Tensor token indexes encoder: sb.dataio.encoder.TextEncoder a text encoder instance Returns ------- output_seq: list a list of lists of tokens ry)r.r`rrr Útext_decode¯s r•cs8‡‡fdd„}‡fdd„}|r|n|‰‡‡fdd„}|S)aøReturns a function that recovers the original sequence from one that has been tokenized using a character map Arguments --------- char_map: dict a character-to-output-token-map tokenizer: speechbrain.tokenizers.SentencePiece.SentencePiece a tokenizer instance token_space_index: int the index of the "space" token wordwise: bool Whether to apply detokenize per word. Returns ------- f: callable the tokenizer function cstˆƒ|dˆƒS)z+Detokenizes the sequence one word at a timer)rP©rX)r2r/rr Údetokenize_wordwiseÙsz0char_map_detokenize..detokenize_wordwisecsˆƒj |¡S)zDetokenizes the entire sequencerIr–r<rr Údetokenize_regularÝsz/char_map_detokenize..detokenize_regularcs ‡fdd„|Dƒ}t|ˆƒ}|S)zThe tokenizer functioncsg|]}ˆ|ƒ‘qSrr©r rX)Ú detokenizerr rårz2char_map_detokenize..f..)Ú_map_tokens_batch)r(Údecoded_tokensÚ mapped_tokens)r3ršrr Úfãs zchar_map_detokenize..fr)r3r/r2r0r—r˜ržr)r3ršr2r/r Úchar_map_detokenizeÂs rŸcó‡fdd„|DƒS)aPerforms token mapping, in batch mode Arguments --------- tokens: iterable a list of token sequences char_map: dict a token-to-character mapping Returns ------- result: list a list of lists of characters csg|]}‡fdd„|Dƒ‘qS)cóg|]}ˆ|‘qSrrr©r3rr rûrz0_map_tokens_batch...rr™r¢rr rûsz%_map_tokens_batch..r©r(r3rr¢r r›ìór›cr )zþMaps tokens to characters, for a single item Arguments --------- tokens: iterable a single token sequence char_map: dict a token-to-character mapping Returns ------- result: list a list of tokens cr¡rrrr¢rr r rz$_map_tokens_item..rr£rr¢r r*þr¤r*cs4eZdZdZ‡fdd„Zdd„Z‡fdd„Z‡ZS)ÚLazyInitzŒA lazy initialization wrapper Arguments --------- init : callable The function to initialize the underlying object cs tƒ ¡d|_||_d|_dSr)ÚsuperÚ__init__ÚinstanceÚinitÚdevice)Úselfr©©Ú __class__rr r§s zLazyInit.__init__cCs|jdur | ¡|_|jS)zEInitializes the object instance, if necessary and returns it.N)r¨r©)r«rrr Ú__call__s zLazyInit.__call__cs>tƒ |¡|jdur| ¡|_t|jdƒr|j |¡|_|S)zÊMoves the underlying object to the specified device Arguments --------- device : str | torch.device the device Returns ------- self NÚto)r¦r¯r¨r©Úhasattr)r«rªr¬rr r¯&s zLazyInit.to)Ú__name__Ú __module__Ú__qualname__Ú__doc__r§r®r¯Ú __classcell__rrr¬r r¥s r¥cCst|ƒS)aLA wrapper to ensure that the specified object is initialized only once (used mainly for tokenizers that train when the constructor is called Arguments --------- init: callable a constructor or function that creates an object Returns ------- instance: object the object instance )r¥)r©rrr Ú lazy_init:sr¶cCs|dkr|S|›d|›S)aHDetermines the key to be used for sequences (e.g. graphemes/phonemes) based on the naming convention Arguments --------- key: str the key (e.g. "graphemes", "phonemes") mode: str the mode/suffix (raw, eos/bos) Returns ------- key if ``mode=="raw"`` else ``f"{key}_{mode}"`` ÚrawÚ_r)rÚmoderrr Úget_sequence_keyLsrºcCs||ƒ}dd„|DƒS)a¿Converts a batch of phoneme sequences (a single tensor) to a list of space-separated phoneme label strings, (e.g. ["T AY B L", "B UH K"]), removing any special tokens Arguments --------- phns: torch.Tensor a batch of phoneme sequences decoder: Callable Converts tensor to phoneme label strings. Returns ------- result: list a list of strings corresponding to the phonemes provided cSsg|] }d t|ƒ¡‘qS)r)rÚremove_specialr™rrr rqsz%phonemes_to_label..r)ÚphnsÚdecoderÚphn_decodedrrr Úphonemes_to_label^sr¿cCsdd„|DƒS)aRemoves any special tokens from the sequence. Special tokens are delimited by angle brackets. Arguments --------- phn: list a list of phoneme labels Returns ------- result: list the original list, without any special tokens cSsg|]}d|vr|‘qS)ú.r)rfrrr r»tsr»c CsJd}|r#|ƒ |¡}|jd}t| d¡| d¡| d¡|d d¡}|S)a¶Applies word embeddings, if applicable. This function is meant to be used as part of the encoding pipeline Arguments --------- txt: str the raw text grapheme_encoded: torch.Tensor the encoded graphemes grapheme_encoded_len: torch.Tensor encoded grapheme lengths grapheme_encoder: speechbrain.dataio.encoder.TextEncoder the text encoder used for graphemes word_emb: callable the model that produces word embeddings use_word_emb: bool a flag indicated if word embeddings are to be applied Returns ------- char_word_emb: torch.Tensor Word embeddings, expanded to the character dimension Nrr)Úembr.Úseq_lenr1)Ú embeddingsrrÚ unsqueezeÚsqueeze) rr"Úgrapheme_encoded_lenrÚword_embÚuse_word_embÚ char_word_embÚraw_word_embÚword_separator_idxrrr Úword_emb_pipeline…s üûrÌ)NT)Trr$Nr)NN)NNN)%r´ÚreÚ functoolsrrrÚspeechbrainr^Úspeechbrain.wordemb.utilrÚcompilerrr#r8r+rPrCrerjrsrwr|r„rr”r•rŸr›r*ÚModuler¥r¶rºr¿r»rÌrrrr ÚsP % ù?# - ÿ**ú