o Û·iðMã@sÊdZddlZddlZddlZddlmZddlmZmZm Z ddl Zddlm Z ddlmZddlmZerdd„Ze dd„ƒZedd„ƒZdd„Z d?deedeeededeef‡fdd„ Zdd„Zdd„Zd eedeefd!d"„Z d@deedeeedeefd#d$„Z d@deedeeedeefd%d&„Zd'd(„Zd)d*„Zd+d,dee f‡fd-d.„Zed/d0„ƒZd1d2„Zd3d4„Z d5d6„Z!d7d8„Z"d@d9e d:ee de#e fd;d<„Z$‡Z%S)AÚT5TokenizeraÒ Construct a T5 tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece). This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: vocab_file (`str`): [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that contains the vocabulary necessary to instantiate a tokenizer. eos_token (`str`, *optional*, defaults to `""`): The end of sequence token. When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. pad_token (`str`, *optional*, defaults to `""`): The token used for padding, for example when batching sequences of different lengths. extra_ids (`int`, *optional*, defaults to 100): Add a number of extra ids added to the vocabulary for use as sentinels. These tokens are accessible as "" where "{%d}" is a number between 0 and extra_ids-1. These tokens can be retrieved by calling get_sentinel_tokens method and token ids can be by calling get_sentinel_token_ids method additional_special_tokens (`list[str]`, *optional*): Additional special tokens used by the tokenizer. sp_model_kwargs (`dict`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set: - `enable_sampling`: Enable subword regularization. - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. - `nbest_size = {0,1}`: No sampling is performed. - `nbest_size > 1`: samples from the nbest_size results. - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm. - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. legacy (`bool`, *optional*): Whether or not the `legacy` behaviour of the tokenizer should be used. Legacy is before the merge of #24622 and #25224 which includes fixes to properly handle tokens that appear after special tokens. A simple example: - `legacy=True`: ```python >>> from transformers import T5Tokenizer >>> tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-base", legacy=True) >>> tokenizer.encode("Hello .") [8774, 32099, 3, 5, 1] ``` - `legacy=False`: ```python >>> from transformers import T5Tokenizer >>> tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-base", legacy=False) >>> tokenizer.encode("Hello .") # the extra space `[3]` is no longer here [8774, 32099, 5, 1] ``` Checkout the [pull request](https://github.com/huggingface/transformers/pull/24565) for more details. add_prefix_space (`bool`, *optional*, defaults to `False`): Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. Attributes: sp_model (`SentencePieceProcessor`): The *SentencePiece* processor that is used for every conversion (string, tokens and IDs). Ú input_idsÚattention_maskúúúédNTÚsp_model_kwargsÚreturnc s¾t|tƒrt|ddn|}t|tƒrt|ddn|}t|tƒr%t|ddn|}|dur-in||_||_||_tjdi|j¤Ž|_|j |¡|durydd„|Dƒ}t |ƒdkrc|dd„t|ƒDƒ7}n!|dkrx|t |ƒkrxtd|›d |›d ƒ‚ndd„t|ƒDƒ}|}i|_ tt |ƒƒD]}td|›d dddddd|j t |jƒd||<q|durºt d|j›d¡d}||_| | dd¡¡|_| |_tƒjd||||||j|| dœ| ¤ŽdS)NT)ÚspecialcSsg|] }dt|ƒvr|‘qS)ú ™sz(T5Tokenizer.__init__..écSóg|]}d|›d‘qS©rú>r©rÚirrrr ›órzBoth extra_ids (z!) and additional_special_tokens (zk) are provided to T5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokenscSr"r#rr%rrrr £r'rr$F)Úsingle_wordÚlstripÚrstriprÚ normalizedz2You are using the default legacy behaviour of the a_. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565Ú from_slow)Ú eos_tokenÚ unk_tokenÚ pad_tokenÚ extra_idsÚadditional_special_tokensrÚlegacyÚadd_prefix_spacer)Ú isinstancerr rr Ú _extra_idsÚspmÚSentencePieceProcessorÚsp_modelÚLoadÚlenÚrangeÚ ValueErrorÚ_added_tokens_decoderÚloggerÚwarning_onceÚ __class__r2Úget_spm_processorÚpopr3ÚsuperÚ__init__) Úselfr r-r.r/r0r1rr2r3ÚkwargsÚextra_tokensr&©r@rrrDsX ÿ€ ÿÿø ÷zT5Tokenizer.__init__FcCs²tjdi|j¤Ž}|js|r| |j¡|St|jdƒ3}| ¡}td|j j ›dƒ}|j |¡}| ¡}d|_|j |¡| ¡}| |¡Wdƒ|S1sRwY|S)NÚrbzThe new behaviour of z (with `self.legacy = False`)Fr)r6r7rr2r9r ÚopenÚreadrr@Ú__name__Ú ModelProtoÚ FromStringÚNormalizerSpecÚadd_dummy_prefixÚnormalizer_specÚ MergeFromÚSerializeToStringÚLoadFromSerializedProto)rEr,Ú tokenizerÚfr8Ú model_pb2ÚmodelrQrrrrAÈs" ø ÷ zT5Tokenizer.get_spm_processorcCsZ|tjvr+tj|}|dur||kr|S|dur+t d|›d|›d|›d|›d t¡|S)NzGThis tokenizer was incorrectly instantiated with a model max length of zÎ which will be corrected in Transformers v5. For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`. - Be aware that you SHOULD NOT rely on z( automatically truncating your input to zM when padding/encoding. - If you want to encode/pad to sequences longer than zÞ you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding. - To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.)rÚmax_model_input_sizesÚwarningsÚwarnÚ FutureWarning)Úpretrained_model_name_or_pathÚmax_model_lengthÚinit_max_model_lengthÚdeprecated_max_model_lengthrrrÚ!_eventually_correct_t5_max_lengthÙs$ ÿüûú ö z-T5Tokenizer._eventually_correct_t5_max_lengthcCs |j ¡S©N)r8Úget_piece_size©rErrrÚ vocab_sizeïs zT5Tokenizer.vocab_sizecs(‡fdd„tˆjƒDƒ}| ˆj¡|S)Ncsi|]}ˆ |¡|“qSr)Úconvert_ids_to_tokensr%rdrrÚ ôr'z)T5Tokenizer.get_vocab..)r;reÚupdateÚadded_tokens_encoder)rEÚvocabrrdrÚ get_vocabószT5Tokenizer.get_vocabÚtoken_ids_0Útoken_ids_1Úalready_has_special_tokenscsZ|rtƒj||ddS|durdgt|ƒdgSdgt|ƒdgdgt|ƒdgS)aÄ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`list[int]`): List of IDs. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `list[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)rlrmrnNrr!)rCÚget_special_tokens_maskr:)rErlrmrnrHrrroøsÿ(z#T5Tokenizer.get_special_tokens_maskcCstttdd„|jƒƒƒS)NcSstt d|¡ƒduS)Nz)ÚboolÚreÚsearch)rrrrÚsz1T5Tokenizer.get_sentinel_tokens..)ÚlistÚsetÚfilterr1rdrrrÚget_sentinel_tokenssÿzT5Tokenizer.get_sentinel_tokenscs‡fdd„ˆ ¡DƒS)Ncsg|]}ˆ |¡‘qSr)Úconvert_tokens_to_ids)rÚtokenrdrrr sz6T5Tokenizer.get_sentinel_token_ids..)rwrdrrdrÚget_sentinel_token_idssz"T5Tokenizer.get_sentinel_token_idsÚ token_idscCs>t|ƒdkr|d|jkrt d|j›d¡|S||jgS)z.Do not add eos again if user already added it.réÿÿÿÿzThis sequence already has zQ. In future versions this behavior may lead to duplicated eos tokens being added.)r:Úeos_token_idrZr[r-)rEr{rrrÚ_add_eos_if_not_presentsÿz#T5Tokenizer._add_eos_if_not_presentcCs<|jg}|durt||ƒdgSt||||ƒdgS)aÇ Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make use of token type ids, therefore a list of zeros is returned. Args: token_ids_0 (`list[int]`): List of IDs. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `list[int]`: List of zeros. Nr)r}r:)rErlrmÚeosrrrÚ$create_token_type_ids_from_sequences'sz0T5Tokenizer.create_token_type_ids_from_sequencescCs(| |¡}|dur|S| |¡}||S)a‚ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format: - single sequence: `X ` - pair of sequences: `A B ` Args: token_ids_0 (`list[int]`): List of IDs to which the special tokens will be added. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `list[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. N)r~)rErlrmrrrÚ build_inputs_with_special_tokens=s z,T5Tokenizer.build_inputs_with_special_tokenscCs|j ¡}d|d<|S)Nr8)Ú__dict__Úcopy)rEÚstaterrrÚ__getstate__Ws zT5Tokenizer.__getstate__cCs<||_t|dƒsi|_tjdi|j¤Ž|_|j |j¡dS)Nrr)r‚Úhasattrrr6r7r8r9r )rEÚdrrrÚ__setstate__\s zT5Tokenizer.__setstate__Útextr csŠ|js t|ƒdkrtƒj|fi|¤ŽS| td¡}|jr t|}tƒj|fi|¤Ž}t|ƒdkrC|dtkrC|d|jvrC|dd…}|S)zŸ Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the first token is special. rú r!N)r2r:rCÚtokenizeÚreplaceÚSPIECE_UNDERLINEr3Úall_special_tokens©rEr‰rFÚtokensrHrrr‹fs&zT5Tokenizer.tokenizecCst|j t|jƒ¡ƒSrb)r:r8Úencoderr.rdrrrÚunk_token_lengthxszT5Tokenizer.unk_token_lengthcKsZ|js | tdf¡s|jj|tdS|jj|j|td}t|ƒ|jkr+||jd…S|S)u( Returns a tokenized string. We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give `['H', 'e', 'y']` instead of `['â–He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the `unk_token`. Here is an example with `unk_token = ""` and `unk_token_length = 4`. `self.tokenizer.sp_model.encode(" Hey", out_type = str)[4:]`. rŠ)Úout_typeN) r2Ú startswithrr8r‘rr.r:r’rrrrÚ _tokenize|s zT5Tokenizer._tokenizecCs|j |¡S)z0Converts a token (str) in an id using the vocab.)r8Úpiece_to_id)rEryrrrÚ_convert_token_to_idŽsz T5Tokenizer._convert_token_to_idcCs|j |¡}|S)z=Converts an index (integer) in a token (str) using the vocab.)r8Ú IdToPiece)rEÚindexryrrrÚ_convert_id_to_token’sz T5Tokenizer._convert_id_to_tokencCs˜|d t¡r|jr|ddd…|d<g}d}d}|D]#}||jvr8|s)|d7}||j |¡|7}d}g}q| |¡d}q||j |¡7}| ¡S)z:Converts a sequence of tokens (string) in a single string.rr!NÚFrŠT)r”rr3rŽr8ÚdecodeÚappendÚstrip)rErÚcurrent_sub_tokensÚ out_stringÚprev_is_specialryrrrÚconvert_tokens_to_string—s z$T5Tokenizer.convert_tokens_to_stringÚsave_directoryÚfilename_prefixcCsÔtj |¡st d|›d¡dStj ||r|dndtd¡}tj |j¡tj |¡kr?tj |j¡r?t |j|ƒ|fStj |j¡sgt|dƒ}|j ¡}| |¡Wdƒ|fS1sbwY|fS)NzVocabulary path (z) should be a directoryú-r›r Úwb)ÚosÚpathÚisdirr>ÚerrorÚjoinÚVOCAB_FILES_NAMESÚabspathr ÚisfilerrJr8Úserialized_model_protoÚwrite)rEr£r¤Úout_vocab_fileÚfiÚcontent_spiece_modelrrrÚsave_vocabulary®s"ÿ(û þüzT5Tokenizer.save_vocabulary)rrrrNNNT)F)NFrb)&rLÚ __module__Ú__qualname__Ú__doc__r¬Úvocab_files_namesÚmodel_input_namesrÚdictrrrDrAÚstaticmethodraÚpropertyrerkrtÚintrprorwrzr~r€rr…rˆr‹r’r•r—ršr¢Útupler´Ú __classcell__rrrHrr,s|Nöøô I ÿÿ ÿÿþÿÿ ÿ þÿÿ ÿ þ (r)r·r§rqrZÚshutilrÚtypingrrrrr6Úconvert_slow_tokenizerrÚtokenization_utilsrÚtokenization_utils_baser r ÚutilsrÚutils.import_utilsrÚ get_loggerrLr>r¬rrÚ__all__rrrrÚs.