o ÔÙ¾iQã@s~dZddlmZmZddlmZddlmZddlm Z m Z Gdd„deƒZz e de¡Wd Se y>eejd<Yd Sw) a%LFM2-MoE (Liquid Foundation Model 2 - Mixture of Experts) configuration Note: HF transformers has Lfm2MoeConfig in v5.0.0rc2 (unreleased). Once released, we could inherit from it like Lfm2Config does with HFLfm2Config. For now, we define a standalone config to support the model immediately. é)ÚListÚOptional)ÚCONFIG_MAPPING)ÚPretrainedConfig)ÚMamba2CacheParamsÚMamba2StateShapec5s$eZdZdZdZdgZ d:dedededededededededed ed!ed"ed#ed$ed%e e d&ed'ed(ed)ed*ed+ed,ed-ed.e eef2‡fd/d0„ Z ed1eefd2d3„ƒZed1eefd4d5„ƒZed1efd6d7„ƒZed1e efd8d9„ƒZ‡ZS);Ú Lfm2MoeConfigaÀ Configuration for LFM2-MoE models (e.g., LiquidAI/LFM2-8B-A1B). LFM2-MoE is a hybrid architecture with: - Attention layers and ShortConv layers (like dense LFM2) - MoE (Mixture of Experts) FFN layers with sigmoid routing Key MoE specifics: - First `num_dense_layers` use dense MLP, rest use MoE - Sigmoid routing (not softmax) with expert_bias for load balancing - expert_bias is fp32 for numerical stability Úlfm2_moeÚpast_key_valuesééééé ééôç{®Gáz”?çñhãˆµøä>TrééNFééçð?Ú vocab_sizeÚhidden_sizeÚintermediate_sizeÚmoe_intermediate_sizeÚnum_hidden_layersÚnum_attention_headsÚnum_key_value_headsÚmax_position_embeddingsÚinitializer_rangeÚnorm_epsÚ use_cacheÚpad_token_idÚbos_token_idÚeos_token_idÚtie_word_embeddingsÚrope_parametersÚ conv_biasÚconv_L_cacheÚnum_dense_layersÚnum_expertsÚnum_experts_per_tokÚuse_expert_biasÚrouted_scaling_factorÚnorm_topk_probÚlayer_typescsØ||_||_||_||_||_||_||_||_| |_| |_ ||_ ||_||_||_ ||_||_||_||_||_||_||_|durVt|ƒ|krVtdt|ƒ›d|›dƒ‚| d|¡}tƒjd|| ||dœ|¤ŽdS)Nzlayer_types length (z ) must match num_hidden_layers (ú)Ú tie_embedding)r$r%r&r'©)rrrrrrrr r!r"r#r)r*r+r,r-r.r/r0r1r(ÚlenÚ ValueErrorÚpopÚsuperÚ__init__)Úselfrrrrrrrr r!r"r#r$r%r&r'r(r)r*r+r,r-r.r/r0r1Úkwargs©Ú __class__r4úO/home/ubuntu/.local/lib/python3.10/site-packages/sglang/srt/configs/lfm2_moe.pyr9-sHÿÿü ûzLfm2MoeConfig.__init__ÚreturncCó"|jdurgSdd„t|jƒDƒS)z0Return indices of attention layers for KV cache.NcSsg|] \}}|dkr|‘qS)Úfull_attentionr4©Ú.0ÚiÚltr4r4r>Ú sz:Lfm2MoeConfig.full_attention_layer_ids..©r1Ú enumerate©r:r4r4r>Úfull_attention_layer_ids|s z&Lfm2MoeConfig.full_attention_layer_idscCr@)z3Return indices of conv layers for conv state cache.NcSsg|] \}}|dvr|‘qS))ÚconvÚ short_convr4rBr4r4r>rFˆsz2Lfm2MoeConfig.linear_layer_ids..rGrIr4r4r>Úlinear_layer_idsƒs ÿzLfm2MoeConfig.linear_layer_idscCsdS)z@Return chunk size for Mamba2 backend. LFM2 doesn't use chunking.rr4rIr4r4r>Úmamba_chunk_sizeŒszLfm2MoeConfig.mamba_chunk_sizec Cstddlm}|j}|s dS|j}t|jƒ}z|ƒ}Wn ttfy'd}Ynwtj ||d||d|d}t ||dS)z’ Get cache params for HybridReqToTokenPool initialization. LFM2-MoE uses ShortConv layers with a small fixed-size cache. r)Úget_attention_tp_sizeNr)Ú tp_world_sizerÚn_groupsÚ num_headsÚhead_dimÚ state_sizeÚconv_kernel)ÚshapeÚlayers)Úsglang.srt.layers.dp_attentionrOrMrÚintr*ÚAssertionErrorÚRuntimeErrorrÚcreater)r:rOÚconv_layer_idsrrUÚtp_sizerVr4r4r>Úmamba2_cache_params‘s0 ÿùþz!Lfm2MoeConfig.mamba2_cache_params)rrr rrrrrrrTrrrTNFrrrrTrTN)Ú__name__Ú __module__Ú__qualname__Ú__doc__Ú model_typeÚkeys_to_ignore_at_inferencerYÚfloatÚboolrÚdictrÚstrr9ÚpropertyrJrMrNrr_Ú __classcell__r4r4r<r>rs° äþýüûúùø ÷ öõô óòñðïîíëêéèçæ äOrr N)rcÚtypingrrÚtransformersrÚ transformers.configuration_utilsrÚsglang.srt.configs.mamba_utilsrrrÚregisterÚ ExceptionÚ_extra_contentr4r4r4r>Ús!þ