o à¥µi ã@s<ddlmZmZddlmZddlmZGdd„deƒZdS)é)ÚListÚUnion)ÚInferFramework)Úis_vllm_availablec sneZdZ ddedededef‡fdd „ Zd eeeeeefdeefdd „Zdefdd„Z ‡Z S)ÚVllmÚautoNéÚmodel_id_or_dirÚdtypeÚquantizationÚtensor_parallel_sizecsVtƒ |¡tƒs tdƒ‚ddlm}t d¡s|dvrd}||j||d|d|_ d S) a Args: dtype: The dtype to use, support `auto`, `float16`, `bfloat16`, `float32` quantization: The quantization bit, default None means do not do any quantization. tensor_parallel_size: The tensor parallel size. zLInstall vllm by `pip install vllm` before using vllm to accelerate inferencer)ÚLLMé)Úbfloat16rÚfloat16T)r rÚtrust_remote_coderN) ÚsuperÚ__init__rÚImportErrorÚvllmr rÚcheck_gpu_compatibilityÚ model_dirÚmodel)Úselfr r rrr ©Ú __class__©úX/home/ubuntu/.local/lib/python3.10/site-packages/modelscope/pipelines/accelerate/vllm.pyr s"ÿÿûz Vllm.__init__ÚpromptsÚreturnc KsÄ| dd¡}| dd¡}| dd¡}| dd¡}|s"|dkr"d|d<|r.|t|d ƒ|d <|r4||d <d dlm}|di|¤Ž}t|d tƒrUdd „|jj||dDƒSdd „|jj||dDƒS)zÚGenerate tokens. Args: prompts(`Union[List[str], List[List[int]]]`): The string batch or the token list batch to input to the model. kwargs: Sampling parameters. Ú do_sampleNÚnum_beamrÚ max_lengthÚmax_new_tokensTÚuse_beam_searchrÚ max_tokens)ÚSamplingParamscSóg|]}|jdj‘qS©r©ÚoutputsÚtext©Ú.0ÚoutputrrrÚ Bóÿz!Vllm.__call__..)Úsampling_paramscSr'r(r)r,rrrr/Gr0)Úprompt_token_idsr1r)ÚpopÚlenrr&Ú isinstanceÚstrrÚgenerate) rrÚkwargsr r!r"r#r&r1rrrÚ__call__%s. ÿÿÿÿz Vllm.__call__Ú model_typecst‡fdd„dDƒƒS)Ncsg|]}|ˆ ¡v‘qSr)Úlower)r-r©r:rrr/Mr0z-Vllm.model_type_supported..)ÚllamaÚbaichuanÚinternlmÚmistralÚaquilaÚbloomÚfalconÚgptÚmptÚoptÚqwenrA)Úany)rr:rr<rÚmodel_type_supportedLsÿzVllm.model_type_supported)rNr)Ú__name__Ú __module__Ú__qualname__r6Úintrrrr9rIÚ __classcell__rrrrrs"üÿþýü ÿ'rN)ÚtypingrrÚ$modelscope.pipelines.accelerate.baserÚmodelscope.utils.import_utilsrrrrrrÚs