o à·i<ã@s&ddlZddlmZddlmZmZmZmZmZm Z ddl Z ddl mZddlm Z ddlmZddlmZmZddlmZmZd d lmZd dlmZgZdZGd d„dejƒZGdd„dejƒZGdd„de jj ej!ƒZ"Gdd„de jj ej!ƒZ#Gdd„dƒZ$Gdd„dƒZ%eGdd„dƒƒZ&eGdd„dƒƒZ'Gdd„dƒZ(eGdd „d e'e&e$eƒƒZ)eGd!d"„d"e'e&e%eƒƒZ*eGd#d$„d$e(e&e$eƒƒZ+eGd%d&„d&e(e&e%eƒƒZ,e+d'ej-d(d)d*Z.d+e._/e,d,ej-d-d)d*Z0d.e0_/e)d/ej-d(d)d0e 1¡d1Z2d2e2_/e*d3ej-d-d)d0e 1¡d1Z3d4e3_/dS)5éN)Ú dataclass)ÚAnyÚDictÚListÚOptionalÚTupleÚUnion)ÚTensor)Úload_state_dict_from_url)Úmu_law_decoding)Ú Tacotron2ÚWaveRNN)Ú GriffinLimÚInverseMelScaleé)Úutils)ÚTacotron2TTSBundlez.https://download.pytorch.org/torchaudio/modelscsNeZdZ‡fdd„Zedd„ƒZdeeeefde e e ffdd„Z‡ZS) Ú_EnglishCharProcessorcs.tƒ ¡t ¡|_dd„t|jƒDƒ|_dS)NcSói|]\}}||“qS©r)Ú.0ÚiÚsrrúT/home/ubuntu/vllm_env/lib/python3.10/site-packages/torchaudio/pipelines/_tts/impl.pyÚ óz2_EnglishCharProcessor.__init__..)ÚsuperÚ__init__rÚ _get_charsÚ_tokensÚ enumerateÚ_mapping©Úself©Ú __class__rrrs z_EnglishCharProcessor.__init__cCó|jS©N©rr"rrrÚtokensóz_EnglishCharProcessor.tokensÚtextsÚreturncs,t|tƒr|g}‡fdd„|Dƒ}t |¡S)Ncs"g|] }‡fdd„| ¡Dƒ‘qS)cs g|]}|ˆjvrˆj|‘qSr©r!)rÚcr"rrÚ &s z=_EnglishCharProcessor.__call__...)Úlower)rÚtr"rrr/&s"z2_EnglishCharProcessor.__call__..)Ú isinstanceÚstrrÚ _to_tensor)r#r+Úindicesrr"rÚ__call__#s z_EnglishCharProcessor.__call__© Ú__name__Ú __module__Ú__qualname__rÚpropertyr)rr3rrr r6Ú __classcell__rrr$rrs .rcsTeZdZddœ‡fdd„ Zedd„ƒZdeeeefde e e ffd d „Z‡ZS)Ú_EnglishPhoneProcessorN©Ú dl_kwargscsDtƒ ¡t ¡|_dd„t|jƒDƒ|_tjd|d|_d|_ dS)NcSrrr)rrÚprrrr.rz3_EnglishPhoneProcessor.__init__..zen_us_cmudict_forward.ptr>z(\[[A-Z]+?\]|[_!'(),.:;? -])) rrrÚ_get_phonesrr r!Ú_load_phonemizerÚ_phonemizerÚ_pattern©r#r?r$rrr+s z_EnglishPhoneProcessor.__init__cCr&r'r(r"rrrr)2r*z_EnglishPhoneProcessor.tokensr+r,csbt|tƒr|g}g}ˆj|ddD]}dd„t ˆj|¡Dƒ}| ‡fdd„|Dƒ¡qt |¡S)NÚen_us)ÚlangcSsg|] }t dd|¡‘qS)z[\[\]]Ú)ÚreÚsub)rÚrrrrr/=sz3_EnglishPhoneProcessor.__call__..csg|]}ˆj|‘qSrr-)rr@r"rrr/>r) r2r3rCrIÚfindallrDÚappendrr4)r#r+r5ÚphonesÚretrr"rr66s z_EnglishPhoneProcessor.__call__r7rrr$rr=*s .r=csBeZdZddedeef‡fdd„ Zedd„ƒZdd d „Z ‡Z S) Ú_WaveRNNVocoderéœÿÿÿÚmodelÚmin_level_dbcs tƒ ¡d|_||_||_dS)Né"V)rrÚ_sample_rateÚ_modelÚ _min_level_db)r#rRrSr$rrrHs z_WaveRNNVocoder.__init__cCr&r'©rUr"rrrÚsample_rateNr*z_WaveRNNVocoder.sample_rateNcCsŽt |¡}dt tj|dd¡}|jdur&|j||j}tj|ddd}|j ||¡\}}t ||jj ¡}t ||jjƒ}| d¡}||fS)Négñhãˆµøä>)Úminrr)r[Úmax) ÚtorchÚexpÚlog10ÚclamprWrVÚinferrÚ_unnormalize_waveformÚn_bitsrÚ n_classesÚsqueeze)r#Úmel_specÚlengthsÚwaveformrrrÚforwardRs z_WaveRNNVocoder.forward)rQr')r8r9r:r rÚfloatrr;rYrir<rrr$rrPGs rPcs2eZdZ‡fdd„Zedd„ƒZddd„Z‡ZS) Ú_GriffinLimVocoderc s@tƒ ¡d|_tdd|jddddd|_tdd d dd|_dS)NrTiéPgg@¿@Úslaney)Ún_stftÚn_melsrYÚf_minÚf_maxÚ mel_scaleÚnormiré)Ún_fftÚpowerÚ hop_lengthÚ win_length)rrrUrrYÚ_inv_melrÚ_griffin_limr"r$rrr`s" ù üz_GriffinLimVocoder.__init__cCr&r'rXr"rrrrYsr*z_GriffinLimVocoder.sample_rateNcCsFt |¡}| ¡ ¡ d¡}| |¡}| ¡ d¡}| |¡}||fS)NTF)r]r^ÚcloneÚdetachÚrequires_grad_ryrz)r#rfrgÚspecÚ waveformsrrrriws z_GriffinLimVocoder.forwardr')r8r9r:rr;rYrir<rrr$rrk_s rkc@seZdZdejfdd„ZdS)Ú _CharMixinr,cCótƒSr')rr"rrrÚget_text_processor†óz_CharMixin.get_text_processorN©r8r9r:rÚ TextProcessorr‚rrrrr€…sr€c@s"eZdZddœdejfdd„ZdS)Ú_PhoneMixinNr>r,cCs t|dS©Nr>)r=rErrrr‚‹s z_PhoneMixin.get_text_processorr„rrrrr†Šsr†c@s:eZdZUeed<eeefed<ddœdefdd„ZdS)Ú_Tacotron2MixinÚ_tacotron2_pathÚ_tacotron2_paramsNr>r,cCóVtdi|j¤Ž}t›d|j›}|durin|}t|fi|¤Ž}| |¡| ¡|S©Nú/r)rrŠÚ _BASE_URLr‰r Úload_state_dictÚeval©r#r?rRÚurlÚ state_dictrrrÚ get_tacotron2”ó z_Tacotron2Mixin.get_tacotron2) r8r9r:r3Ú__annotations__rrrr”rrrrrˆs rˆc@sJeZdZUeeed<eeeefed<ddœdd„Zddœdd„Z dS) Ú _WaveRNNMixinÚ _wavernn_pathÚ_wavernn_paramsNr>cCs|j|d}t|ƒSr‡)Ú_get_wavernnrP)r#r?ÚwavernnrrrÚget_vocoder£sz_WaveRNNMixin.get_vocodercCr‹rŒ)r r™rŽr˜r rrr‘rrrrš§r•z_WaveRNNMixin._get_wavernn) r8r9r:rr3r–rrrœršrrrrr—žs r—c@seZdZdd„ZdS)Ú_GriffinLimMixincKrr')rk)r#Ú_rrrrœ²rƒz_GriffinLimMixin.get_vocoderN)r8r9r:rœrrrrr±src@óeZdZdS)Ú_Tacotron2WaveRNNCharBundleN©r8r9r:rrrrr »ór c@rŸ)Ú_Tacotron2WaveRNNPhoneBundleNr¡rrrrr£Àr¢r£c@rŸ)Ú_Tacotron2GriffinLimCharBundleNr¡rrrrr¤År¢r¤c@rŸ)Ú_Tacotron2GriffinLimPhoneBundleNr¡rrrrr¥Êr¢r¥z5tacotron2_english_characters_1500_epochs_ljspeech.pthé&)Ú n_symbols)r‰rŠaþCharacter-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs, and :py:class:`~torchaudio.transforms.GriffinLim` as vocoder. The text processor encodes the input texts character-by-character. You can find the training script `here `__. The default parameters were used. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Example - "Hello world! T T S stands for Text to Speech!" .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

Example - "The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired," .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH_v2.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

z3tacotron2_english_phonemes_1500_epochs_ljspeech.pthé`aèPhoneme-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs and :py:class:`~torchaudio.transforms.GriffinLim` as vocoder. The text processor encodes the input texts based on phoneme. It uses `DeepPhonemizer `__ to convert graphemes to phonemes. The model (*en_us_cmudict_forward*) was trained on `CMUDict `__. You can find the training script `here `__. The text processor is set to the *"english_phonemes"*. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Example - "Hello world! T T S stands for Text to Speech!" .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

Example - "The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired," .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH_v2.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

z=tacotron2_english_characters_1500_epochs_wavernn_ljspeech.pthz%wavernn_10k_epochs_8bits_ljspeech.pth)r‰rŠr˜r™aCharacter-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs and :py:class:`~torchaudio.models.WaveRNN` vocoder trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs. The text processor encodes the input texts character-by-character. You can find the training script `here `__. The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``, ``mel_fmin=40``, and ``mel_fmax=11025``. You can find the training script `here `__. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Example - "Hello world! T T S stands for Text to Speech!" .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_WAVERNN_CHAR_LJSPEECH.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

Example - "The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired," .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_WAVERNN_CHAR_LJSPEECH_v2.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

z;tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech.pthaPhoneme-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs, and :py:class:`~torchaudio.models.WaveRNN` vocoder trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs. The text processor encodes the input texts based on phoneme. It uses `DeepPhonemizer `__ to convert graphemes to phonemes. The model (*en_us_cmudict_forward*) was trained on `CMUDict `__. You can find the training script for Tacotron2 `here `__. The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``, ``mel_fmin=40``, and ``mel_fmax=11025``. You can find the training script for WaveRNN `here `__. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Example - "Hello world! T T S stands for Text to Speech!" .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_WAVERNN_PHONE_LJSPEECH.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

Example - "The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired," .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_WAVERNN_PHONE_LJSPEECH_v2.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

)4rIÚdataclassesrÚtypingrrrrrrr]r Útorchaudio._internalr Útorchaudio.functionalrÚtorchaudio.modelsrr Útorchaudio.transformsrrrHrÚ interfacerÚ__all__rŽr…rr=ÚnnÚModuleÚVocoderrPrkr€r†rˆr—rr r£r¤r¥Ú_get_taco_paramsÚ"TACOTRON2_GRIFFINLIM_CHAR_LJSPEECHÚ__doc__Ú#TACOTRON2_GRIFFINLIM_PHONE_LJSPEECHÚ_get_wrnn_paramsÚTACOTRON2_WAVERNN_CHAR_LJSPEECHÚ TACOTRON2_WAVERNN_PHONE_LJSPEECHrrrrÚsp & þ# þ( ü% ü