o }o™iº<ã@sÆddlmZmZmZddlZddlmmZddlm Z mZddl mZddl mZddlmZmZGdd„deƒZGdd „d eƒZGd d„deƒZde d e fdd„Zde de fdd„Zdd„ZdS)é)ÚDictÚLiteralÚTupleN)ÚTensorÚnn)Ú all_gather)ÚMaskedTokenLossReductionÚMegatronLossReductionc szeZdZdZddedededdf‡fd d „ Zdeeej fdej de ej eeej fffd d„Zdej fdd„Z‡Z S)ÚBERTLossReductionzYBert Loss Function. when add_sop_loss = False, only calculate Masked token loss. FTÚvalidation_stepÚ val_drop_lastÚadd_sop_lossÚreturnNcs4tƒ ¡||_||_||_|st||ƒ|_dSdS)N)ÚsuperÚ__init__rrr rÚmlm)Úselfrrr ©Ú __class__©úR/home/ubuntu/.local/lib/python3.10/site-packages/nemo/collections/llm/bert/loss.pyrs þzBERTLossReduction.__init__ÚbatchÚforward_outcCs¦|d|d<|js|j ||d¡Sddlm}|d|d}}|dus)Jdƒ‚| ¡}|dkr@t||d ƒ}t||dƒ}ntd ƒ‚||} t | gƒ} | d| ifS)zqPerform Loss calculation on batch. Currently, Context parallelism is not supported for SOP loss. Ú loss_maskÚlm_lossr©Úparallel_stateÚ binary_logitsNz…Attempting to calculate Sentence Order Prediction Loss but SOP logits are not provideds, Please Make sure you have added binary head.éÚ is_randomz$CP is not supported for SOP loss yetÚavg) r rÚforwardÚ megatron.corerÚget_context_parallel_world_sizeÚsentence_order_prediction_lossÚmasked_token_with_zeroÚNotImplementedErrorÚ)average_losses_across_data_parallel_group)rrrrÚlm_loss_Ú sop_logitsÚcp_sizeÚsop_loss_for_ubÚlm_loss_for_ubÚloss_for_ubÚreduced_lossrrrr!(s ÿ zBERTLossReduction.forwardcCóÀ|rVd|dvrdd„|Dƒ}t |¡ ¡}|Sddlm}dd„|Dƒ}t|ƒdkr4t |¡jddntjddgtj ¡d }tjj||j d dd|d|d }|Stjdtj ¡d S)úTaken from: https://github.com/NVIDIA/NeMo/blob/main /nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py#L535-L552 .r rcSóg|]}|d‘qS©r r©Ú.0ÚxrrrÚ Oóz,BERTLossReduction.reduce..rcSó$g|]}|dddkr|d‘qS©Úloss_sum_and_ub_sizerrrr3rrrr6Uó©Údimç©ÚdeviceT©Úwith_context_parallel©Úgroupr©ÚtorchÚcatÚmeanr"rÚlenÚvstackÚsumÚtensorÚcudaÚcurrent_deviceÚdistributedÚ all_reduceÚget_data_parallel_group©rÚlosses_reduced_per_micro_batchr Úlossrr:rrrÚreduceIó(ÿÿý þzBERTLossReduction.reduce)FTT)Ú__name__Ú __module__Ú__qualname__Ú__doc__ÚboolrrÚstrrFrrr!rUÚ __classcell__rrrrr s ÿÿ þ!r c sŒeZdZdZ ddeded ed ededd f‡fdd„ Zdee e jfde jdee jee e jfffdd„Z de jfdd„Z‡ZS)ÚHardNegativeRankingLossaE This loss uses hard-negative samples. The difference of this loss to the default MultipleNegativesRankingLoss from Sentence Transformers is that the latter shares the hard negatives as negatives for all examples, whereas this loss uses hard negatives exclusively for the example they are associated. FTré2r>rrÚnum_hard_negativesÚscaleÚlabel_smoothingrNcs4tƒ ¡||_||_||_||_tj|d|_dS©N)rb) rrrrr`rarÚCrossEntropyLossÚcross_entropy_loss)rrrr`rarbrrrrqs z HardNegativeRankingLoss.__init__rrcCsvddlm}| ¡}|dkrtd|j›dƒ‚d|j}d|j}|jd|}| |¡}t dd„|Dƒ¡} t d d„|Dƒ¡} | jd| jddks[Jd | jd| jd¡ƒ‚| jd| jd|ksvJd | jd| jd|¡ƒ‚| j}| dd|¡ |d||d¡}tj|| dd |d|¡} tj|dtj| jd}| |j9} | | |¡}t|gƒ}|d|ifS)NrrrúCP is not supported for ú yet.écSr1©rr©r4Úitemrrrr6r7z3HardNegativeRankingLoss.forward..cSsg|]}|dd…‘qS)rNrrjrrrr6óz{} % {} > 0z {} / {} != {}éÿÿÿÿr<©Údtyper@r )r"rr#r&rr`ÚshapeÚchunkrFÚstackrGÚformatÚrepeatÚreshaperKÚzerosÚlongr@rarer')rrrrr*Únum_tensors_per_exampleÚcurrent_train_n_passagesÚ batch_sizeÚchunksÚqueryÚkeyÚquery_shapeÚrepeated_queryÚscoresÚlabelsÚce_lossr.rrrr!€s0 4ÿÿ zHardNegativeRankingLoss.forwardcCr/)r0r rcSr1r2rr3rrrr6§r7z2HardNegativeRankingLoss.reduce..rcSr8r9rr3rrrr6r;r<r>r?TrArCrrErRrrrrU¡rVzHardNegativeRankingLoss.reduce)FTrr_r>)rWrXrYrZr[ÚintÚfloatrrr\rFrrr!rUr]rrrrr^hs8 úþýüûúùÿÿ þ!r^cs¤eZdZdZ dded ed ededed ededddf‡fdd„ Zdd„Z de eej fdej deej e eej fffdd„Zdej fdd„Z‡ZS)Ú,BERTInBatchExclusiveHardNegativesRankingLossa¼ This loss uses in-batch negative samples + hard-negative samples. The difference of this loss to the default MultipleNegativesRankingLoss from Sentence Transformers is that the latter shares the hard negatives as negatives for all examples, whereas this loss uses hard negatives exclusively for the example they are associated. This loss is also capable of using in-batch negatives from all ranks during training. FTrér>Úlocalrrr`rarbÚglobal_in_batch_negativesÚ backprop_type)r‡ÚglobalrNcs@tƒ ¡||_||_||_||_tj|d|_||_ ||_ dSrc)rrrrr`rarrdrerˆr‰)rrrr`rarbrˆr‰rrrrËs z5BERTInBatchExclusiveHardNegativesRankingLoss.__init__cs‚ddlm}ˆ ¡‰|jdkr4‡fdd„t| ¡ƒDƒ}t|ˆ| ¡dˆ|| ¡<t j |dd}|Stˆƒ}t j |dd}|S)Nrrr‡csg|]}t ˆ¡‘qSr)rFÚ zeros_like)r4Ú_©Úlocal_tensorrrr6ãs ÿzhBERTInBatchExclusiveHardNegativesRankingLoss._gather_global_in_batch_representations..rCr<)r"rÚ contiguousr‰ÚrangeÚget_data_parallel_world_sizeÚall_gather_no_backproprQÚget_data_parallel_rankrFrGÚall_gather_with_backprop)rrŽrÚglobal_tensorsrrrÚ'_gather_global_in_batch_representationsÞs ÿýzTBERTInBatchExclusiveHardNegativesRankingLoss._gather_global_in_batch_representationsrrcsPddlm}| ¡}|dkrtd|j›dƒ‚|jr"|js"| |¡}d|j}|j d|}| |¡‰t dd„ˆDƒ¡}t d d„ˆDƒ¡}‡fd d„t |jƒDƒ} t || dd¡¡} t | d¡ t| ƒdd¡t | ¡¡jddj}tj| |gdd}| d d¡}||j9}tjt t|ƒƒtj|jd} | || ¡}t|gƒ}|d|ifS)NrrrrfrgrhcSr1rirrjrrrr6r7zHBERTInBatchExclusiveHardNegativesRankingLoss.forward..cSr1©rrrjrrrr6r7cs$g|]‰t ‡fdd„ˆDƒ¡‘qS)csg|]}|ˆd‘qS)rhrrj©Úirrr6rlzSBERTInBatchExclusiveHardNegativesRankingLoss.forward...)rFrr)r4©r{r˜rr6sÿrm)Úaxisgð¿gð?rnr )r"rr#r&rrˆrr–r`rprqrFrrrÚmmÚ transposeÚmultiplyÚ unsqueezertrIrKÚTrGÚclamprarLrwr@rer')rrrrr*rxrzÚqueriesÚ positivesÚ hard_negsÚpos_in_batch_negs_scoresÚhard_negs_scoresr€rr‚r.rršrr!ðsD ÿÿþüú ÿ z4BERTInBatchExclusiveHardNegativesRankingLoss.forwardcCr/)r0r rcSr1r2rr3rrrr6#r7zGBERTInBatchExclusiveHardNegativesRankingLoss.reduce..rcSr8r9rr3rrrr6)r;r<r>r?TrArCrrErRrrrrUrVz3BERTInBatchExclusiveHardNegativesRankingLoss.reduce)FTrr†r>Fr‡)rWrXrYrZr[rƒr„rrr–rr\rFrrr!rUr]rrrrr…ÀsFøþýüûúùø ÷ÿÿ þ-r…rLÚmaskcCsZ| ¡}| ¡}| ¡dkrt | d¡¡d}|St | d¡| d¡¡| ¡}|S)aSCalculate masked token loss with consideration of possible NaN. Sometimes when the number of tokens is very small, none of the tokens get masked for prediction. In that case loss mask is all zeros i.e Happens when the entire batch is masked out (Practically when MBS=1 or 2, and the number of tokens in each batch is < 7 ) rrmr>)r„rKrFÚviewru)rLr§ÚlossesrrTrrrr%<s"þr%Úsentence_ordercCs.| dd¡ ¡}| d¡}tj||dd}|S)z)Calculate sentence order prediction loss.rmrh)Úignore_index)r¨r„ÚFÚ cross_entropy)rLrªr©rTrrrr$Ls r$cCsNddlm}t dd„|Dƒ¡}tjj|| ¡d|tjj| ¡d}|S)z*Reduce a tensor of losses across all GPUs.rrcSsg|]}| ¡ ¡ d¡‘qSr—)ÚcloneÚdetachr¨)r4rTrrrr6Ysz=average_losses_across_data_parallel_group..rC)r"rrFrGrOrPrQÚget_world_size)r©rÚaveraged_lossesrrrr'Usÿr')ÚtypingrrrrFÚtorch.nn.functionalrÚ functionalr¬rÚtorch.distributedrr’Útorch.distributed.nn.functionalr”Ú nemo.lightning.megatron_parallelrr r r^r…r%r$r'rrrrÚsNX|