o ºifØã@s<dZddlmZmZmZddlZddlmZddlmZddl m Z mZddlm Z mZmZmZmZmZddlmZmZmZmZd d lmZd dlmZmZd dlmZd d lm Z d dl!m"Z"m#Z#m$Z$d dl%m&Z&d dl'm(Z(m)Z)ddl*m+Z+e)ƒrŠddl,m-Z-ddl.m/Z/m0Z0ndZ-e(ƒr˜ddl1m2Z2m3Z3nd\Z3Z2e4e-e2e3fƒZ5e$ 6e7¡Z8Gdd„deddZ9Gdd„de ƒZ Gdd„deƒZ:d6dd „Z;Gd!d"„d"e ƒZGd'd(„d(ej?ƒZ@Gd)d*„d*eƒZAGd+d,„d,eƒZBGd-d.„d.eƒZCe"Gd/d0„d0eƒƒZDe"Gd1d2„d2eDƒƒZEGd3d4„d4eƒZFgd5¢ZGdS)7zPyTorch Bamba model.é)ÚOptionalÚ TypedDictÚUnionN)Únn)ÚACT2FN)Ú HybridMambaAttentionDynamicCacheÚJambaAttentionDecoderLayer)ÚLlamaAttentionÚLlamaForCausalLMÚLlamaMLPÚLlamaRMSNormÚLlamaRotaryEmbeddingÚrotate_half)ÚMambaRMSNormGatedÚpad_tensor_by_sizeÚreshape_into_chunksÚsegment_sumé)ÚAttentionMaskConverter)ÚBaseModelOutputWithPastÚCausalLMOutputWithPast)ÚPreTrainedModel)ÚUnpack)Úauto_docstringÚcan_return_tupleÚlogging)Údeprecate_kwarg)Úis_causal_conv1d_availableÚis_mamba_2_ssm_availableé)ÚBambaConfig)Úselective_state_update)Úmamba_chunk_scan_combinedÚ mamba_split_conv1d_scan_combined)Úcausal_conv1d_fnÚcausal_conv1d_update)NNc@s@eZdZUdZejed<ejed<eed<eed<ejed<dS)ÚBambaFlashAttentionKwargsaŠ Keyword arguments for advanced Flash Attention, causal-conv1d, and mamba_ssm kernel usage. Use cases include padding-free training and fewer `torch.compile` graph breaks. Attributes: cu_seq_lens_q (`torch.LongTensor`) Gets cumulative sequence length for query state. cu_seq_lens_k (`torch.LongTensor`) Gets cumulative sequence length for key state. max_length_q (`int`): Maximum sequence length for query state. max_length_k (`int`): Maximum sequence length for key state. seq_idx (`torch.IntTensor): Index of each packed sequence. Ú cu_seq_lens_qÚ cu_seq_lens_kÚmax_length_qÚmax_length_kÚseq_idxN) Ú__name__Ú __module__Ú__qualname__Ú__doc__ÚtorchÚ LongTensorÚ__annotations__ÚintÚ IntTensor©r5r5úd/home/ubuntu/veenaModal/venv/lib/python3.10/site-packages/transformers/models/bamba/modular_bamba.pyr&Ks r&F)Útotalc@s&eZdZdZejdfdefdd„ZdS)ra¤ A dynamic cache that can handle both the attention cache (which has a seq_len dimension) and the mamba cache (which has a constant shape regardless of seq_len). This cache has two sets of lists of tensors: `key_cache` and `value_cache` for attention cache and `conv_states` and `ssm_states` for mamba cache. Each of these lists has `num_layers` tensors. The expected shape for each tensor For attention layers, `key_cache` and `value_cache` have a shape of `(batch_size, num_heads, seq_len, head_dim)`, while `conv_states` and `ssm_states` have a shape of `(batch_size, 0)` (empty tensors). For mamba layers, `key_cache` and `value_cache` have a shape of `(batch_size, 0)` (empty tensors), while `conv_states` represents the convolution state and has a shape of `(batch_size, d_inner, d_conv)`, and `ssm_states` represents the ssm state and has a shape of `(batch_size, d_inner, d_state)`. NÚconfigcs0|j|_d|_|j}|j}g|_g|_g|_t|jƒD]^}|j|dkrS|jt j ˆ|j|jd|j ||ˆ|dg7_|jt j ˆ|j|j|ˆ|dg7_q|jt jggˆˆdg7_|jt jggˆˆdg7_|j |¡q‡‡fdd„t|jƒDƒ|_‡‡fdd„t|jƒDƒ|_dS) NFÚmambaé©ÚdeviceÚdtype©r<có g|]}tjggˆˆd‘qS©r>©r0Útensor©Ú.0Ú_©Ú batch_sizer<r5r6Ú –ó z=HybridMambaAttentionDynamicCache.__init__..cr?r@rArCrFr5r6rH—rI)Úlayers_block_typeÚhas_previous_stateÚmamba_d_convÚ mamba_d_stateÚconv_statesÚ ssm_statesÚtransformer_layersÚrangeÚnum_hidden_layersr0ÚzerosÚmamba_expandÚhidden_sizeÚmamba_n_groupsÚ mamba_n_headsÚmamba_d_headrBÚappendÚ key_cacheÚvalue_cache)Úselfr8rGr=r<Úconv_kernel_sizeÚssm_state_sizeÚir5rFr6Ú__init__ssBûÿ ú ÿ z)HybridMambaAttentionDynamicCache.__init__)r,r-r.r/r0Úfloat16r r`r5r5r5r6res rc@óeZdZdS)ÚBambaRotaryEmbeddingN©r,r-r.r5r5r5r6rcšórcc Cs¶| |¡}| |¡}|jd}|dd|…f|d|d…f}}|dd|…f|d|d…f} } ||t|ƒ|}| |t| ƒ|}tj||gdd}tj|| gdd}||fS)aApplies Rotary Position Embedding to the query and key tensors. Removes the interleaving of cos and sin from GLM Args: q (`torch.Tensor`): The query tensor. k (`torch.Tensor`): The key tensor. cos (`torch.Tensor`): The cosine part of the rotary embedding. sin (`torch.Tensor`): The sine part of the rotary embedding. position_ids (`torch.Tensor`, *optional*): Deprecated and unused. unsqueeze_dim (`int`, *optional*, defaults to 1): The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2. Returns: `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding. éÿÿÿÿ.N©Údim)Ú unsqueezeÚshaperr0Úcat) ÚqÚkÚcosÚsinÚposition_idsÚ unsqueeze_dimÚ rotary_dimÚq_rotÚq_passÚk_rotÚk_passÚq_embedÚk_embedr5r5r6Úapply_rotary_pos_embŸs ""ryc@rb)ÚBambaAttentionNrdr5r5r5r6rzÇrerzc@rb)ÚBambaRMSNormGatedNrdr5r5r5r6r{Ërer{cCsN|dur%|jddkr%|jddkr%|j}||dd…dd…df |¡}|S)zm Tunes out the hidden states for padding tokens, see https://github.com/state-spaces/mamba/issues/66 Nrr)rjr=Úto)Ú hidden_statesÚattention_maskr=r5r5r6Úapply_mask_to_padding_statesÏs$ rcsÒeZdZdZdedef‡fdd„Z ddejde e d e ejd e ejde ejf dd „Z dde e d e ejd e ejfdd„Z dde e d e ejd e ejde ejfdd„Z‡ZS)Ú BambaMixeruP Compute âˆ†, A, B, C, and D the state space parameters and compute the `contextualized_states`. A, D are input independent (see Mamba paper [1] Section 3.5.2 "Interpretation of A" for why A isn't selective) âˆ†, B, C are input-dependent (this is a key difference between Mamba and the linear time invariant S4, and is why Mamba is called **selective** state spaces) The are a few differences between this and Mamba2Mixer: - The variable use_precomputed_states is slightly different due to the hybrid cache structure - There's a few non-obvious bugs fixed with batching in the slow path that exist in main - Some extra variables that our layer doesn't need have been removed - We ported most of the refactors in https://github.com/huggingface/transformers/pull/35154, which is (as of Dec 18, 2024) unmerged r8Ú layer_idxcsžtƒ ¡|j|_|j|_|j|_|j|_t |j |jƒ|_||_|j |_|j|_t|j|_|j|_|j|_|j|_|j|_|j|_dtdƒf|_d|_d|_ |jd|j|j|_!t"j#|j!|j!|j |j|j!|jdd|_$|j|j!|j}t"j%|j||jd|_&t" 't( )|j¡¡|_*t( +d|jd¡}t" 't( ,|¡¡|_-t.|j|jd |_/t" 't( )|j¡¡|_0t"j%|j|j|jd|_1t2sÈt3 4d ¡dSt3 4d¡dS)NçÚinfgü©ñÒMbP?gš™™™™™¹?r:r)Úin_channelsÚout_channelsÚbiasÚkernel_sizeÚgroupsÚpadding)r†©ÚepsaThe fast path is not available because one of `(selective_state_update, causal_conv1d_fn, causal_conv1d_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation and https://github.com/Dao-AILab/causal-conv1dzDThe fast path for Bamba will be used when running the model on a GPU)5Úsuperr`rWÚ num_headsrUrMr^rLr]r3rTÚintermediate_sizerÚmamba_conv_biasÚ use_conv_biasÚ hidden_actÚ activationrÚactÚmamba_proj_biasÚuse_biasÚrms_norm_epsÚlayer_norm_epsilonrVÚn_groupsrXÚhead_dimÚmamba_chunk_sizeÚ chunk_sizeÚfloatÚtime_step_limitÚ time_step_minÚ time_step_maxÚconv_dimrÚConv1dÚconv1dÚLinearÚin_projÚ Parameterr0ÚonesÚdt_biasÚarangeÚlogÚA_logr{ÚnormÚDÚout_projÚis_fast_path_availableÚloggerÚwarning_once)r\r8rÚprojection_sizeÚA©Ú __class__r5r6r`ésX ú ý ÿzBambaMixer.__init__Nr}Úcache_paramsÚcache_positionr~r+cCsœt||ƒ}| |¡}|j\}}} |j|j} |duoD|joD|dkoD|j|jjd|j|jjdko8|knoD|duoD|ddk}|r)| d¡j |j|j|j gdd\}} }t| |j|j|jj d¡|jj|jƒ} tj | |j| | gdd\}}}t |j ¡¡}|dd…ddfdd…dd…df d|j|j¡jtjd}|dd…dd…df dd|j¡}|jdd…ddf d|j¡}|jdd…ddf d|j¡}| ||j|jd|j¡}| ||j|jd|j¡}| ||j |j¡}t|j|j||||||d|dd }| ||j |j¡}| ||¡}| |¡dd…ddf}|St |j ¡¡}|j!d td ƒfkr>ind|j!i}|j"r||dur|t#||jj d¡|jj|j|f|j|j$||j|jj|jj%|j j|j j|j|jddd œ|¤Ž}|S|j |j|j|j gdd\}} }|dur¯| &dd¡}t'j( )||j*|jddf¡}|j|j +|¡|jdvrÌ| ,| | &dd¡¡dd|…f &dd¡¡} nt-| &dd¡|jj d¡|jj|j|d &dd¡} t| |ƒ} tj | |j| | gdd\}}}t.| ||d|j¡||| |||jd¡| |||jd¡f|j$|jd|d|jddœ|¤Ž\}}|dur:|dur:|j|j +|¡| ||d¡}| ||¡}| |¡}|S)Nrrrfrg.©r=T)Úzr§Údt_softplusr‚rƒÚdt_limitF)r¬r›r+r’Úrmsnorm_weightÚrmsnorm_epsÚoutproj_weightÚoutproj_biasÚheaddimÚngroupsÚnorm_before_gateÚreturn_final_statesr:)ÚsiluÚswish)ÚxÚweightr†r’r+)r›r¬r¸r+rÂr§r¹)/rr¤rjr˜r^rKrNrrOÚsqueezeÚsplitrŽr rr%r¢rÆr†r’r0ÚexprªrœÚexpandr™r|Úfloat32r§r¬Úviewr!r«rrÚtrainingr#r›Úvariance_epsilonÚ transposerÚ functionalÚpadr]Úcopy_r“r$r")r\r}rµr¶r~r+Úprojected_statesrGÚseq_lenrEÚgroups_time_state_sizeÚuse_precomputed_statesÚgateÚhidden_states_B_CÚdtÚBÚCr²r§r¬Úhidden_states_reshapedÚoutÚdt_limit_kwargsÚhidden_states_B_C_transposedrNÚscan_outputÚ ssm_stater5r5r6Úcuda_kernels_forward*s ÿþÿþýú ø ÿ û ý<" ö^¥"ûïîVÀÿ þ$ÿ ûú ýûô ó zBambaMixer.cuda_kernels_forwardc3 sØ|j\}}}|j}t||ƒ}ˆ |¡} | jˆjˆjˆjgdd\} }}|duoQ|joQ|dkoQ|j ˆj jd|jˆj jdkoE|knoQ|duoQ|ddk} | r|j ˆj jddd|j ˆj <|dd…ddd…f |j ˆj j¡|j ˆj dd…dd…df<|j ˆj j ˆjjjd}tj|ˆjj d¡dd}ˆjr§|ˆjj}ˆ |¡}n8|durÏ| dd¡}tj |ˆj|jddf¡}|j ˆj |¡ˆ ˆ | dd¡¡dd|…f dd¡¡}t||ƒ}tj|ˆjˆjˆjˆjˆjgdd\}}}t ˆj !¡¡}| r[|jˆj j}|dd…ddd…fdd…ddf}| dd¡ "||jdˆj#¡}ˆj$d "ˆj$jdˆj#¡}tjj %|| |j¡¡}t &|ˆj'dˆj'd¡}|d "ˆjˆj#ˆj¡j tj(d}t |d |¡j |d}| )|ˆjd¡dddd…f}| "|ˆjˆjˆj|jd¡ *¡}| )|d|jd¡}|d |dddd…f}| )|dˆj#¡}||d j |d}|jˆj |jˆj ||¡| )|ˆjd¡dddd…f}| "|ˆjˆjˆj|jd¡ *¡}| )|d|jd¡}|jˆj j |j|jd}| +|ˆjˆj#ˆj¡}| +|ˆjˆjd¡}t ,||¡}| +|ˆjˆj#¡}ˆj-d "ˆj-jdˆj#¡}||| |j¡}| )|d¡dd…ddf}ntj %|ˆj$¡}t &|ˆj'dˆj'd¡}| )||dˆj#¡ !¡}| )||dˆj¡ !¡}| )||dˆj¡ !¡}|j.ˆjˆjdˆjd }|j.ˆjˆjdˆjd }ˆj/|ˆj/ˆj/‰ˆj-d t0|ˆƒ}||d }| |j¡|}‡‡fdd„||||fDƒ\}}}}| 1dddd¡}tj2|dd}t t3|ƒ¡} |dd…dd…dd…ddd…dd…f|dd…dd…ddd…dd…dd…f}!|!jdd}"|"d | 1ddddd¡d }#|#jdd}$|$d |dd…dd…dfjdd}%t |dd…dd…dd…dd…f|¡}&||& 1dddd¡d }'|'dddd…f|d jdd}(| r•|jˆj dd…ddfj |(jd})n t 4|(dd…dd…f¡})tj5|)|(gdd}(t t3tj |dd…dd…dd…dfd¡ƒ¡}*|* dd¡}*|*d |(dd…dd…ddfjdd}+|+dd…dd…f|+dd…df}(},t |¡}-|dddd…f|(dd…dd…ddf}.|- 1dddd¡}/|. d¡|/d }0|%|0}| )|dˆjˆj#¡}||}ˆdkrB|dd…d|…dd…dd…f}| )||d¡}|,dur\|dur\|jˆj |,¡ˆ 6|| ¡}1ˆ 7|1 |¡¡}2|2S)Nrfrgrr)ÚshiftsÚdimsr>r:.).N).NNr·r;)rhÚoutput_sizecsg|] }t|ˆˆjƒ‘qSr5)rr›)rDÚt©Úpad_sizer\r5r6rH]sz,BambaMixer.torch_forward..rééþÿÿÿ)rr)8rjr=rr¤rÈrŽr rrKrNrrOÚrollr|r<r¢rÆr0ÚsumrÇrr†r“rÏrrÐrÑr]rÒr˜r^rÉrªrœrÊr™r§ÚsoftplusÚclamprrËÚreshapeÚ contiguousrÌÚbmmr¬Úrepeat_interleaver›rÚpermuteÚcumsumrÚ zeros_likerkr«r)3r\Úinput_statesrµr¶r~rGrÔrEr=rÓr×rØrÙrÖrNrßr}rÚrÛr²Úcache_devicer§ÚdAÚdBÚdBxrOÚssm_states_reshapedÚ C_reshapedÚyr¬Ú D_residualÚA_cumsumÚLÚG_intermediateÚGÚM_intermediateÚMÚY_diagÚdecay_statesÚB_decayÚstatesÚprevious_statesÚdecay_chunkÚ new_statesráÚstate_decay_outÚC_times_statesÚstate_decay_out_permutedÚY_offràÚcontextualized_statesr5rçr6Ú torch_forwardÔsò ÿÿþÿþýú ø@ÿÿ, ý$"$ÿ$$P&*"&0(& * zBambaMixer.torch_forwardcKstrd|jjjjvr| |||||¡S|durtdƒ‚|j}|dur@|jddkr@|jddkr@||dd…dd…df |¡}| ||||¡S)NÚcudaz\`seq_idx` support requires fast path support. Please install `mamba_ssm` and `causal_conv1d`rr)r®r¤rÆr<ÚtyperâÚNotImplementedErrorr=rjr|r)r\r}rµr¶r~r+Úkwargsr=r5r5r6Úforward£s ÿ$ zBambaMixer.forward)NNNN)NNN)r,r-r.r/r r3r`r0ÚTensorrrr1r4rârrÚ __classcell__r5r5r³r6r€ÛsV Dúþýüû ú.ûýü ûSúýüûúr€c@rb)ÚBambaMLPNrdr5r5r5r6rºrerc@rb)ÚBambaRMSNormNrdr5r5r5r6r¾rercsÊeZdZddededef‡fdd„ Zeddd d dd ej de ej de ejde ede e de e de ejde eej ej fdeedeeje eejejfffdd„ƒZ‡ZS)ÚBambaDecoderLayerr9r8rÚ layer_typecsptƒ ||¡|`d}|dkrtnd}||ƒ|_||_|dkr(t||d|_dS|dkr4t||ƒ|_dSt dƒ‚)Nrr9)r8rÚ attentionzInvalid layer_type) rŒr`Ú self_attnrÚfeed_forwardrr€r9rzÚ ValueError)r\r8rrÚnum_expertsÚffn_layer_classr³r5r6r`Ãs zBambaDecoderLayer.__init__Úpast_key_valueÚpast_key_valuesz4.58)Únew_nameÚversionNFr}r~rpÚoutput_attentionsÚ use_cacher¶Úposition_embeddingsrÚreturnc Ks¨|} | |¡}|jdkr|jd||||dœ| ¤Ž}d}n|jdkr4|jd||||||||dœ| ¤Ž\}}| |}|} | |¡}| |¡}| |}|f}|rR||f7}|S)aü Args: hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)` attention_mask (`torch.FloatTensor`, *optional*): attention mask of size `(batch, sequence_length)` where padding elements are indicated by 0. past_key_values (`HybridMambaAttentionDynamicCache`, *optional*): cached past key and value projection states output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. use_cache (`bool`, *optional*): If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`). cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*): Indices depicting the position of the input sequence tokens in the sequence. position_embeddings (`tuple[torch.FloatTensor, torch.FloatTensor]`, *optional*): Tuple containing the cosine and sine positional embeddings of shape `(batch_size, seq_len, head_dim)`, with `head_dim` being the embedding dimension of each attention head. kwargs (`dict`, *optional*): Arbitrary kwargs. Can be used to provide `BambaFlashAttentionKwargs` for padding-free training and/or improve torch.compile performance. r9)r}rµr¶r~Nr)r}r~rpr$r'r(r¶r)r5)Úinput_layernormrr9rÚpre_ff_layernormr) r\r}r~rpr$r'r(r¶r)rÚresidualÚself_attn_weightsÚoutputsr5r5r6rÔsD# üû ø ÷ zBambaDecoderLayer.forward)r9)NNNFFNN)r,r-r.r r3Ústrr`rr0rrr1rÚboolÚtuplerr&ÚFloatTensorrrr5r5r³r6rÂs>÷þýüûúùø ÷ öõrcsDeZdZUeed<dZdZdgZdZdZ dZ dZ‡fdd„Z‡Z S)ÚBambaPreTrainedModelr8ÚmodelTrr$csVtƒ |¡t|tƒr)|jj d¡t t d|j d¡¡|j_|jj d¡dSdS)Ngð?r) rŒÚ _init_weightsÚ isinstancer€r§ÚdataÚfill_r0r©r¨rrªr¬)r\Úmoduler³r5r6r6/s ýz"BambaPreTrainedModel._init_weights)r,r-r.r r2Úbase_model_prefixÚsupports_gradient_checkpointingÚ_no_split_modulesÚ_skip_keys_device_placementÚ_supports_flash_attnÚ_supports_sdpaÚ_is_statefulr6rr5r5r³r6r4#s r4csþeZdZdef‡fdd„Zee ddeej deej deej deed eejd ee dee dee d eej deedefdd„ƒƒZdej dej d ej dede f dd„Zedej dededejd ej defdd„ƒZdd„Z‡ZS)Ú BambaModelr8cs¤tƒ |¡|j|_|j|_t |j|j|j¡|_g}t |j ƒD]}| t|||j |d¡q t |¡|_|j|_t|j|jd|_t|d|_d|_| ¡dS)N)rrrŠ)r8F)rŒr`Úpad_token_idÚpadding_idxÚ vocab_sizerÚ EmbeddingrUÚembed_tokensrQrRrYrrJÚ ModuleListÚlayersÚ_attn_implementationrr–Úfinal_layernormrcÚ rotary_embÚgradient_checkpointingÚ post_init)r\r8Údecoder_layersr_r³r5r6r`9szBambaModel.__init__NÚ input_idsr~rpr$Ú inputs_embedsr(r'Úoutput_hidden_statesr¶rr*c KsÆ|dur|n|jj}|dur|n|jj}|dur|n|jj}|du|duAr*tdƒ‚|jr9|jr9|r9t d¡d}|durB| |¡}|}|rO|durOt d¡| dur^t j|jd|j d} |durg| d¡}| ||| ||¡}| || ¡} | ||¡}|r€dnd}|r†dnd}|jD]5}|jd kr”| n|}|r||f7}||f|||||| |d œ| ¤Ž}|d}|rÀ|ddurÀ||df7}q‹| |¡}|rÍ||f7}|rÕ|jsÕd|_|sÙdn|}t||||dS) Nz:You must specify exactly one of input_ids or inputs_embedszX`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.Fz„Bamba requires an initialized `HybridMambaAttentionDynamicCache` to return a cache. None was provided, so no cache will be returned.rr>rr5r9)r~rpr$r'r(r¶r)T)Úlast_hidden_stater$r}Ú attentions)r8r'rRr(r rMrÍr¯r°rGr0r¨rjr<riÚ_update_causal_maskÚ_update_mamba_maskrLrIrrKrKr)r\rPr~rpr$rQr(r'rRr¶rr}Úcausal_maskÚ mamba_maskr)Úall_hidden_statesÚall_self_attnsÚ decoder_layerÚ layer_maskÚ layer_outputsÚ next_cacher5r5r6rLs~ÿÿ ÿ ÿ ÿø ÷€ üzBambaModel.forwardÚinput_tensorcCsî|jjdkr|durd|vr|SdS|dur| ¡nd}|jjdkr0|s0tj||||jdr0dS|j}|jd}t|t j ƒrC|jdn||d} |j||| |||jdd} |jjdkru|duru|jj d vru|sut |¡j}t | |¡} | S) NÚflash_attention_2r‚rÚsdpa)rQÚpast_key_values_lengthÚis_trainingrrf)Úsequence_lengthÚ target_lengthr=r¶rG)rÚxpuÚnpu)r8rJÚget_seq_lengthrÚ_ignore_causal_mask_sdparÍr=rjr7r0rÚ5_prepare_4d_causal_attention_mask_with_cache_positionr<rÚfinfoÚminÚ_unmask_unattended)r\r~r_r¶r$r'Úpast_seen_tokensr=rdrerWÚ min_dtyper5r5r6rU°sFü ÿ ýú ÿzBambaModel._update_causal_maskrdrer=rGcKs||dur| ¡dkr|}|St |¡j}tj||f|||jd}|dkr+tj|dd}|tj||jd| dd¡k9}|dddd…dd…f |ddd¡}|dur¼| ¡}|jd} |dd…dddd…f|dd…ddd…dfkdd…dd…|d…dd…f |¡} |dd…dd…dd…d| …f| }|dk}|dd…dd…dd…d| …f ||¡|dd…dd…dd…d| …f<|S) aM Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing. Args: attention_mask (`torch.Tensor`): A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`. sequence_length (`int`): The sequence length being processed. target_length (`int`): The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet. dtype (`torch.dtype`): The dtype to use for the 4D attention mask. cache_position (`torch.Tensor`): Indices depicting the position of the input sequence tokens in the sequence. batch_size (`torch.Tensor`): Batch size. Nré)Ú fill_valuer=r<r)Údiagonalr>rfr)rhr0rkrlÚfullr<Útriur¨rïrÊÚclonerjr|Úmasked_fill)r~rdrer=r¶rGrrWroÚmask_lengthÚpadding_attention_maskÚpadding_maskr5r5r6rjìs2ìÿ $ .ÿþ$ ÿz@BambaModel._prepare_4d_causal_attention_mask_with_cache_positioncCs.|}|ddks|durt |dk¡rd}|S)zv No need for zeroing states when 1. Cached forward 2. Attending to all inputs rNr)r0Úall)r\r~r¶rXr5r5r6rV$s"zBambaModel._update_mamba_mask) NNNNNNNNN)r,r-r.r r`rrrr0r1rrr3r1rr&rrrUÚstaticmethodr3r=rjrVrr5r5r³r6rB7s|öþýüûúùø ÷ öõôbþýüû ú<ÿþýüûú7rBcsÂeZdZ‡fdd„Z ddeejdeejdeejdeed eej d eejdee dee d ee deejdeeejfde fdd„Z ddd„Z‡ZS)ÚBambaForCausalLMcs tƒ |¡|j|_| ¡dS)N)rŒr`Úz_loss_coefficientrN)r\r8r³r5r6r`1szBambaForCausalLM.__init__NrrPr~rpr$rQÚlabelsr(r'rRr¶Úlogits_to_keepr*cKs|dur|n|jj}| dur| n|jj} |jd |||||||| | dœ |¤Ž} | j}t|tƒr4t|dƒn|}| |dd…|dd…f¡}d}|durt|j d |||jj dœ|¤Ž}|jdkrt|jddj |jd d¡ ¡}||j|}t||| j| j| jd S)aJ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. Example: ```python >>> from transformers import AutoTokenizer, BambaForCausalLM >>> model = BambaForCausalLM.from_pretrained("...") >>> tokenizer = AutoTokenizer.from_pretrained("...") >>> prompt = "Hey, are you conscious? Can you talk to me?" >>> inputs = tokenizer(prompt, return_tensors="pt") >>> # Generate >>> generate_ids = model.generate(inputs.input_ids, max_length=30) >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ```N) rPr~rpr$rQr(r'rRr¶)Úlogitsr}rErrfrgr·r:)Úlossrr$r}rTr5)r8r'rRr5rSr7r3ÚsliceÚlm_headÚ loss_functionrEr|Ú logsumexpr|r=ÚpowÚmeanrr$r}rT)r\rPr~rpr$rQr}r(r'rRr¶r~rr/r}Ú slice_indicesrr€Úz_lossr5r5r6r8s@%ÿ÷ ö ûzBambaForCausalLM.forwardTc Ks<|du} | s5|dus|d|jdkr"|dd…|jdd…f}n!|jd|jdkr4|dd…|f}nt|j|jd|j|jd}|durl|durl| ¡ d¡d}| |dkd¡| sl|dd…|jdd…f}|durw| rwd|i} nd| ¡i} | |||||jj |dœ¡| ¡D]\}}|| vr›|| |<q| S)Nrfrrr>rQrP)rpr$r(r~r~r¶)rjrr8r=r<ÚlongrôÚmasked_fill_rðÚupdateÚnum_logits_to_keepÚitems) r\rPr$r~rQr¶rpr(rÚ empty_past_kvÚmodel_inputsÚkeyÚvaluer5r5r6Úprepare_inputs_for_generation…sB €ÿ úÿ€z.BambaForCausalLM.prepare_inputs_for_generation)NNNNNNNNNNr)NNNNNT)r,r-r.r`rr0r1rrr3r1rr3rrr’rr5r5r³r6r{0sZ ôþýüûúùø ÷ öõô òPør{)rBr{r4)Nr)Hr/Útypingrrrr0rÚtransformers.activationsrÚ(transformers.models.jamba.modeling_jambarrÚ(transformers.models.llama.modeling_llamar r rrr rÚ*transformers.models.mamba2.modeling_mamba2rrrrÚmodeling_attn_mask_utilsrÚmodeling_outputsrrÚmodeling_utilsrÚprocessing_utilsrÚutilsrrrÚutils.deprecationrÚutils.import_utilsrrÚconfiguration_bambar Ú+mamba_ssm.ops.triton.selective_state_updater!Ú!mamba_ssm.ops.triton.ssd_combinedr"r#Ú causal_conv1dr$r%ryr®Ú get_loggerr,r¯r&rcryrzr{rÚModuler€rrrr4rBr{Ú__all__r5r5r5r6Ús^ 5 (bay