o ¾e¦i,ã@s°UdZddlmZmZddlmZe e¡ZdZ dZ dZdZdZ d Zd Zedede d ede dediZeeefed<dd„e ¡DƒZeeefed<Gdd„deƒZdgZdS)z Tokenization classes for CANINE.é)Ú AddedTokenÚPreTrainedTokenizer)Úloggingiéiàiàiàiàiàz[CLS]z[SEP]z[BOS]z[MASK]z[PAD]z [RESERVED]ÚSPECIAL_CODEPOINTScCói|]\}}||“qS©r)Ú.0Ú codepointÚnamerrúl/home/ubuntu/transcripts/venv/lib/python3.10/site-packages/transformers/models/canine/tokenization_canine.pyÚ 4ór ÚSPECIAL_CODEPOINTS_BY_NAMEcs®eZdZdZgd¢Zeeƒeeƒeeƒeeƒeeƒee ƒddf‡fdd„ Z edefdd „ƒZ d d„Zdedeefd d„Zdedefdd„Zdedefdd„Zdd„Z‡ZS)ÚCanineTokenizeraé Construct a CANINE tokenizer (i.e. a character splitter). It turns text into a sequence of characters, and then converts each character into its Unicode code point. [`CanineTokenizer`] inherits from [`PreTrainedTokenizer`]. Refer to superclass [`PreTrainedTokenizer`] for usage examples and documentation concerning parameters. Args: model_max_length (`int`, *optional*, defaults to 2048): The maximum sentence length the model accepts. )Ú input_idsÚattention_maskÚtoken_type_idsFic s t|tƒrt|dddn|}t|tƒrt|dddn|}t|tƒr(t|dddn|}t|tƒr6t|dddn|}t|tƒrDt|dddn|}t|tƒrRt|dddn|}i|_t ¡D] \} }| |j|<q[dd„|j ¡Dƒ|_t|_t |jƒ|_ tƒjd ||||||||ddddœ| ¤ŽdS) NF)ÚlstripÚrstripTcSrrr)r rr rrrr bs ÿz,CanineTokenizer.__init__..Ú all_zerosÚcls_sep)Ú bos_tokenÚ eos_tokenÚ sep_tokenÚ cls_tokenÚ pad_tokenÚ mask_tokenÚadd_prefix_spaceÚmodel_max_lengthÚtoken_type_ids_patternÚ%token_type_ids_include_special_tokensÚspecial_tokens_patternr) Ú isinstanceÚstrrÚ_special_codepointsrÚitemsÚ_special_codepoint_stringsÚUNICODE_VOCAB_SIZEÚ_unicode_vocab_sizeÚlenÚ_num_special_tokensÚsuperÚ__init__)ÚselfrrrrrrrrÚkwargsr r©Ú __class__rrr-Gs:ÿõ ôzCanineTokenizer.__init__ÚreturncCs|jS)N)r))r.rrrÚ vocab_sizexszCanineTokenizer.vocab_sizecCs$dd„t|jƒDƒ}| |j¡|S)NcSsi|]}t|ƒ|“qSr)Úchr)r Úirrrr }rz-CanineTokenizer.get_vocab..)Úranger3ÚupdateÚadded_tokens_encoder)r.ÚvocabrrrÚ get_vocab|szCanineTokenizer.get_vocabÚtextcCst|ƒS)z5Tokenize a string (i.e. perform character splitting).)Úlist)r.r;rrrÚ _tokenizeszCanineTokenizer._tokenizeÚtokencCs*zt|ƒWStytd|›dƒ‚w)zaConverts a token (i.e. a Unicode character) in an id (i.e. its integer Unicode code point value).zinvalid token: 'ú')ÚordÚ TypeErrorÚ ValueError)r.r>rrrÚ_convert_token_to_id…s ÿz$CanineTokenizer._convert_token_to_idÚindexcCs:z|tvr t|WSt|ƒWStytd|›ƒ‚w)z˜ Converts a Unicode code point (integer) in a token (str). In case it's a special code point, convert to human-readable format. zinvalid id: )rr4rArB)r.rDrrrÚ_convert_id_to_tokenŒs ÿz$CanineTokenizer._convert_id_to_tokencCs d |¡S)NÚ)Újoin)r.ÚtokensrrrÚconvert_tokens_to_string˜s z(CanineTokenizer.convert_tokens_to_string)Ú__name__Ú __module__Ú__qualname__Ú__doc__Úmodel_input_namesr4ÚCLSÚSEPÚPADÚMASKr-ÚpropertyÚintr3r:r$r<r=rCrErIÚ __classcell__rrr0rr7s& ÷1rN)rMÚtokenization_pythonrrÚutilsrÚ get_loggerrJÚloggerr(rQrOrPÚBOSrRÚRESERVEDrÚdictrTr$Ú__annotations__r&rrÚ__all__rrrrÚs* ô" e