o ©Ì³is'ã@s0UddlZddlmZmZmZmZddlZddlmZddlm Z ddl mZmZeƒZ ejed<e rfddlmZmZmZdd „ZeƒZejjd ddejd ejdejdedejf dd„ƒZeejefZnejZdeejdejfdd„Zdeejdejfdd„Zdeejdefdd„Zdefdd„ZdS)éN)ÚCallableÚListÚOptionalÚUnion)Únn)Ú_SUPPORTS_FLEX_ATTENTION)Ú get_loggerÚlog_onceÚ_log)Ú BlockMaskÚcreate_block_maskÚflex_attentionc CsˆztjtddWStyC}z/t d|›d¡ztjtdddWWYd}~Sty>}z t d|›d¡‚d}~wwd}~ww) NF)Údynamicz,Compiling flex_attention failed with error 'z%'. Retrying with mode='max-autotune'.zmax-autotune)rÚmodez-Compiling flex_attention failed with error: 'zŒ', Updating your pytorch version to nightlies may solve it, or you can setin your config dataset.packed=False to avoid using flex attention.)ÚtorchÚcompiler Ú Exceptionr Úinfo)Úe©rúU/home/ubuntu/.local/lib/python3.10/site-packages/torchtune/modules/attention_utils.pyÚcompile_flex_attentions" ÿ ÿ€ú€ørF)Ú recursiveÚqÚkÚvÚ block_maskÚreturncCst||||dS)N©r)Úflex_attention_compiled)rrrrrrrÚcompile_friendly_flex_attention2sr Úseq_lenscCsNt|ƒ}g}t|ƒD]}t dd„t||ƒDƒ¡}| |¡q t |¡}|S)aÒ Convert a batch tensor of seq lens into integer IDs denoting sample ownership. For example, seq_lens = [2, 3, 1] would return [0, 0, 1, 1, 1, 2]. Args: seq_lens (List[torch.Tensor]): Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs. Returns: Tensor: Document IDs of shape (batch_size, max_seq_len). cSs(g|]\}}tj|f|tj|jd‘qS©)ÚdtypeÚdevice)rÚfullÚlongr$©Ú.0ÚiÚseq_lenrrrÚ Usÿÿz3_get_document_ids_from_seq_lens..)ÚlenÚrangerÚcatÚ enumerateÚappendÚstack)r!Ú batch_sizeÚbatch_document_idsÚ sample_idxÚdocument_idsrrrÚ_get_document_ids_from_seq_lens@s þÿ r6cCsJg}t|ƒ}t|ƒD]}dd„t||ƒDƒ}| tj|Ž¡q t |¡S)a Given a batch tensor of seq lens defining the lengths of samples in each pack, Construct a 2D block causal mask for each pack in the batch. For example, if a single sample's seq_lens is [3, 2, 1], the mask would be:: mask = [ [1, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1], ] Args: seq_lens (List[torch.Tensor]): Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs. Returns: Tensor: Block causal mask of shape (batch_size, max_seq_len, max_seq_len). c Ss,g|]\}}t tj||tj|jd¡‘qSr")rÚtrilÚonesÚboolr$r'rrrr+zsýÿÿz,create_block_causal_mask..)r,r-r/r0rÚ block_diagr1)r!Úbatch_block_attn_masksr2r4Úblock_attn_masksrrrÚcreate_block_causal_mask_s ü r=csJtr t|ƒ‰ˆj\}}ˆ d¡‰‡fdd„}t||d||ddSt|dS)aÖ Create a block causal document mask for a batch of packed sequences. If flex attention is supported by the current hardware, block causal logic and passing this into :func:`torch.nn.attention.flex_attention.create_block_mask`. The resultant BlockMask is a compressed representation of the full block causal mask. If on an older version, a standard 2D block causal mask is created and returned. Args: seq_lens (List[torch.Tensor]): Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs. Returns: _MaskType: BlockMask or Tensor if torch version < 2.5.0. Úcudacs(||k}ˆ||fˆ||fk}||@S)a Defines the logic of a block causal mask by combining both a standard causal mask and a block diagonal document mask. See :func:`~torchtune.modules.attention_utils.create_block_causal_mask` for an illustration. r)ÚbÚhÚq_idxÚkv_idxÚcausal_maskÚ document_mask©r5rrÚmask_mod¡sz*packed_block_causal_mask..mask_modN)r$)r!)rr6ÚshapeÚtoÚcreate_block_causal_mask_flexr=)r!r2Úmax_seq_lenrFrrErÚpacked_block_causal_mask…s ú rKcCsptrdtjdtjdtjdttdtdtdtjfdd „}|Sdtjdtjdtjdttdtdtdtjfd d „}|S)aE Helper function to decide when to call flex attention or SDPA. It will use flex attention if ALL of the following conditions are met, otherwise it will default to SDPA: - torch version >= 2.5.0 - we are sample packing, therefore mask is a BlockMask - torch.cuda.get_device_capability() >= (7, 5) rrrÚmaskÚ dropout_pÚ is_causalrcSsvt|tƒrttdtjd|dkrtdƒ‚t||||dS|dur/|dd…ddd…dd…f}tj j ||||||dS)NzOUsing flex attention for attention computation since a BlockMask was passed in.)ÚlevelgzCFlex attention does not support dropout. Please set dropout to 0.0.r©Ú attn_maskrMrN)Ú isinstancerr r ÚloggingÚDEBUGÚ ValueErrorr rÚ functionalÚscaled_dot_product_attention©rrrrLrMrNrrrÚ_attention_callÅs4 ýÿü úz0_sdpa_or_flex_attention.._attention_callcSs<|dur|dd…ddd…dd…f}tjj||||||dS)NrP)rrVrWrXrrrrYös ú)rrÚTensorrÚ _MaskTypeÚfloatr9)rYrrrÚ_sdpa_or_flex_attention¹sB ÿþýüûú ùGêÿþýüûú ùr]) rSÚtypingrrrrrrÚtorchtune.utils._import_guardrÚtorchtune.utils._loggingrr r ÚLoggerÚ__annotations__Ú!torch.nn.attention.flex_attentionrrrIr rrÚcompilerÚdisablerZr r[r6r=rKr]rrrrÚsJ ÿþýüûÿ þ&ÿ þ4