o ÎÎ¯i-ã @s`dZddlZddlZddlZddlZddlZddlmZmZddl m Z mZmZm Z ddlZddlZddlZdejd<dZeƒdd „ƒZeƒd d„ƒZdd „Zdd„Zdd„Zdd„Zdd„Zdd„Zdefdd„Zddœdd„ZGdd„deƒZ d-d!e eeefd"e!d#e!d$e!d%e d&e"fd'd(„Z#d!e eeefd"e!d#e!d$e!d%e f d)d*„Z$defd+d,„Z%dS).zp CLIP tokenizer Copied from https://github.com/openai/CLIP. Originally MIT License, Copyright (c) 2021 OpenAI. éN)Ú lru_cacheÚpartial)ÚCallableÚListÚOptionalÚUnionÚfalseÚTOKENIZERS_PARALLELISMéMcCstj tj tj t¡¡d¡S)Nzbpe_simple_vocab_16e6.txt.gz)ÚosÚpathÚjoinÚdirnameÚabspathÚ__file__©rrúQ/home/ubuntu/.local/lib/python3.10/site-packages/core/vision_encoder/tokenizer.pyÚdefault_bpesÿrcCs°tttdƒtdƒdƒƒtttdƒtdƒdƒƒtttdƒtdƒdƒƒ}|dd…}d }td ƒD]}||vrI| |¡| d |¡|d7}q3dd„|Dƒ}tt||ƒƒS) a: Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. And avoids mapping to whitespace/control characters the bpe code barfs on. ú!ú~éõÂ¡õÂ¬õÂ®õÃ¿NrécSsg|]}t|ƒ‘qSr)Úchr)Ú.0ÚnrrrÚ 6óz$bytes_to_unicode..)ÚlistÚrangeÚordÚappendÚdictÚzip)ÚbsÚcsrÚbrrrÚbytes_to_unicodes ÿþÿ €r*cCs6tƒ}|d}|dd…D]}| ||f¡|}q |S)zReturn set of symbol pairs in a word. Word is represented as tuple of symbols (symbols being variable-length strings). rrN)ÚsetÚadd)ÚwordÚpairsÚ prev_charÚcharrrrÚ get_pairs:sr1cCs"t |¡}t t |¡¡}| ¡S©N)ÚftfyÚfix_textÚhtmlÚunescapeÚstrip©ÚtextrrrÚbasic_cleanFs r:cCst dd|¡}| ¡}|S)Nú\s+ú )ÚreÚsubr7r8rrrÚwhitespace_cleanLsr?cCótt|ƒƒSr2)Úcanonicalize_textr:©ÚxrrrÚ_clean_canonicalizeRórDcCstt|ƒƒ ¡Sr2)r?r:ÚlowerrBrrrÚ_clean_lowerWsrGcCr@r2)r?r:rBrrrÚ_clean_whitespace\rErHÚtypecCs4|dkrtS|dkrtS|dkrtSJd|›dƒ‚)NÚcanonicalizerFÚ whitespaceFzInvalid clean function (z).)rDrGrH©rIrrrÚget_clean_fnasrM)Úkeep_punctuation_exact_stringcCs`| dd¡}|r| dd„| |¡Dƒ¡}n| t ddtj¡¡}| ¡}t dd|¡}| ¡S)aøReturns canonicalized `text` (lowercase and punctuation removed). From: https://github.com/google-research/big_vision/blob/53f18caf27a9419231bbf08d3388b07671616d3d/big_vision/evaluators/proj/image_text/prompt_engineering.py#L94 Args: text: string to be canonicalized. keep_punctuation_exact_string: If provided, then this exact string kept. For example providing '{}' will keep any occurrences of '{}' (but will still remove '{' and '}' that appear separately). Ú_r<css&|]}| t ddtj¡¡VqdS)ÚN)Ú translateÚstrÚ maketransÚstringÚpunctuation)rÚpartrrrÚ ys €ÿ ÿz$canonicalize_text..rPr;)Úreplacer ÚsplitrQrRrSrTrUrFr=r>r7)r9rNrrrrAls þrAc@sˆeZdZeƒdeddfdedeeedeededef d d „Z dd„Z d d„Zdd„Z dde eeefdeedejfdd„ZdS)ÚSimpleTokenizerNrFrPÚbpe_pathÚadditional_special_tokensÚcontext_lengthÚcleanÚreduction_maskcs‚tƒˆ_dd„ˆj ¡Dƒˆ_t |¡ ¡ d¡ d¡}|dd…}dd„|Dƒ}t tƒ ¡ƒ}|d d„|Dƒ}|D] }| d |¡¡q;ddg} |rP| |7} | | ¡tt|tt|ƒƒƒƒˆ_d d„ˆj ¡Dƒˆ_tt|tt|ƒƒƒƒˆ_dd„| Dƒˆ_d | ¡} t | dtj¡ˆ_tˆjƒˆ_‡fdd„| Dƒˆ_ˆjdˆ_ˆjdˆ_|ˆ_t|ƒˆ_ |r¼t!|ƒˆ_"dSdˆ_"dS)NcSói|]\}}||“qSrr©rÚkÚvrrrÚ Žóz,SimpleTokenizer.__init__..úutf-8Ú riÿ¾cSsg|]}t| ¡ƒ‘qSr)ÚtuplerY)rÚmergerrrr‘sz,SimpleTokenizer.__init__..cSsg|]}|d‘qS)úr)rrcrrrr“r rPzz cSr`rrrarrrrd›recSsi|]}||“qSrr©rÚtrrrrdóú|z:|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+cóg|]}ˆj|‘qSr©Úencoderrk©Úselfrrr¤rer)#r*Úbyte_encoderÚitemsÚbyte_decoderÚgzipÚopenÚreadÚdecoderYr!Úvaluesr$r Úextendr%r&r"ÚlenrqÚdecoderÚ bpe_ranksÚcacher=ÚcompileÚ IGNORECASEÚpatÚ vocab_sizeÚall_special_idsÚsot_token_idÚeot_token_idr]rMÚclean_fnÚget_reduction_mask_fnÚreduction_fn)rsr[r\r]r^r_ÚmergesÚvocabriÚspecial_tokensÚspecialrrrrÚ__init__…s@ þ ÿÿzSimpleTokenizer.__init__c sj|ˆjvr ˆj|St|dd…ƒ|ddf}t|ƒ}|s#|dS t|‡fdd„d}|ˆjvr4nu|\}}g}d}|t|ƒkr—z| ||¡} | ||| …¡| }Wn| ||d…¡Yn3|||kr†|t|ƒdkr†||d|kr†| ||¡|d 7}n| ||¡|d7}|t|ƒksBt|ƒ}|}t|ƒdkr¤nt|ƒ}q$d |¡}|ˆj|<|S)NéÿÿÿÿrjTcsˆj |tdƒ¡S)NÚinf)rÚgetÚfloat)ÚpairrrrrÚ·rmz%SimpleTokenizer.bpe..)Úkeyrrér<) r€rhr1Úminrr}Úindexr|r$r ) rsÚtokenr-r.ÚbigramÚfirstÚsecondÚnew_wordÚiÚjrrrrÚbpesH , òå zSimpleTokenizer.bpecshg}ˆ |¡}t ˆj|¡D]#}d ‡fdd„| d¡Dƒ¡}| ‡fdd„ˆ |¡ d¡Dƒ¡q|S)NrPc3ó|]}ˆj|VqdSr2)rt)rr)rrrrrWÚs€z)SimpleTokenizer.encode..rfc3r¢r2rp)rÚ bpe_tokenrrrrrWÛs€ ÿr<) rˆr=Úfindallrƒr Úencoder|r¡rY)rsr9Ú bpe_tokensršrrrrr¥Ös ÿzSimpleTokenizer.encodecsDd ‡fdd„|Dƒ¡}t‡fdd„|Dƒƒjddd dd ¡}|S) NrPcror)r~)rršrrrrrárez*SimpleTokenizer.decode..cror)rv)rÚcrrrrrãrerfrX)Úerrorsrjr<)r Ú bytearrayrzrX)rsÚtokensr9rrrrrzàsÿýzSimpleTokenizer.decodeÚtextsÚreturncsÄt|tƒr|g}|pˆj}|sJdƒ‚ˆjdur%ˆj||ˆjˆjˆjdS‡fdd„|Dƒ}tjt |ƒ|tj d}t|ƒD]"\}}t |ƒ|krR|d|…}ˆj|d<t |¡||dt |ƒ…f<q=|S)aÜReturns the tokenized representation of given input string(s) Parameters ---------- texts : Union[str, List[str]] An input string or a list of input strings to tokenize context_length : int The context length to use; all CLIP models use 77 as the context length Returns ------- A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length] z!Please set a valid context lengthN)r]r†r‡Ú encode_fncs&g|]}ˆjgˆ |¡ˆjg‘qSr)r†r¥r‡©rr9rrrrr sÿÿz,SimpleTokenizer.__call__..©Údtyper) Ú isinstancerRr]rŠr†r‡r¥ÚtorchÚzerosr}ÚlongÚ enumerateÚtensor)rsr«r]Ú all_tokensÚresultrŸrªrrrrÚ__call__és, û þ zSimpleTokenizer.__call__r2)Ú__name__Ú __module__Ú__qualname__rÚDEFAULT_CONTEXT_LENGTHrRrrÚintrr¡r¥rzrr²Ú LongTensorr¹rrrrrZ„s8úþ ýüû ú() ÿÿÿþrZFr«r]r†r‡rÚshufflec sÆ‡fdd„|Dƒ}tjt|ƒ|tjd}t|ƒD]H\}} t | ¡} t| ƒ} | |dkrH|d}t t| ƒ¡}|d|…}|sB| ¡}| |} |} |||df<| ||d| d…f<|||| df<q|S)Ncóg|]}ˆ|ƒ‘qSrrr®©rrrr r z(random_mask_tokenize..r¯r—rr)r²r³r}r´rµr¶ÚrandpermÚmsort) r«r]r†r‡rrÀr·r¸rŸrªÚ num_tokensÚnum_keepÚindicesrrÂrÚrandom_mask_tokenizes" rÈcs¤‡fdd„|Dƒ}tjt|ƒ|tjd}t|ƒD]7\}}t|ƒ} | |dkr:|d} t d| | ¡}|||| …}|g||g}t |¡||dt|ƒ…f<q|S)NcrÁrrr®rÂrrr<r z(simple_mask_tokenize..r¯r—r)r²r³r}r´rµÚrandomÚrandintr¶)r«r]r†r‡rr·r¸rŸrªrÅrÆÚstart_indexrrÂrÚsimple_mask_tokenize5srÌcCs<|dvsJ‚|dkrtS|dkrtS|dkrttddSdS)zNChoose strategy for dropping (masking) tokens to achieve target context length)ÚsimplerÉrÀrÍrÉrÀT)rÀN)rÌrÈrrLrrrr‰Lsÿÿr‰)F)&Ú__doc__rwr5rrÉrTÚ functoolsrrÚtypingrrrrr3Úregexr=r²Úenvironr½rr*r1r:r?rDrGrHrRrMrAÚobjectrZr¾ÚboolrÈrÌr‰rrrrÚsj úÿþýüû úÿþýü û