Loading Orpheus Hindi 3B... INFO 03-14 21:43:37 [model.py:541] Resolved architecture: LlamaForCausalLM INFO 03-14 21:43:37 [model.py:1882] Downcasting torch.float32 to torch.bfloat16. INFO 03-14 21:43:37 [model.py:1561] Using max model len 131072 INFO 03-14 21:43:37 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048. INFO 03-14 21:43:37 [vllm.py:624] Asynchronous scheduling is enabled. WARNING 03-14 21:43:37 [vllm.py:662] Enforce eager set, overriding optimization level to -O0 INFO 03-14 21:43:37 [vllm.py:762] Cudagraph is disabled under eager mode WARNING 03-14 21:43:39 [system_utils.py:140] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized (EngineCore_DP0 pid=1955796) INFO 03-14 21:43:44 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='canopylabs/3b-hi-ft-research_release', speculative_config=None, tokenizer='canopylabs/3b-hi-ft-research_release', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=canopylabs/3b-hi-ft-research_release, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': , 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': , 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': , 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []} (EngineCore_DP0 pid=1955796) INFO 03-14 21:43:47 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://216.81.248.184:57725 backend=nccl (EngineCore_DP0 pid=1955796) INFO 03-14 21:43:47 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] EngineCore failed to start. (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] Traceback (most recent call last): (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 691, in __init__ (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] super().__init__( (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 105, in __init__ (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__ (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] self._init_executor() (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] self.driver_worker.init_device() (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] self.worker.init_device() # type: ignore (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 235, in init_device (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] self.requested_memory = request_memory(init_snapshot, self.cache_config) (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/utils.py", line 260, in request_memory (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] raise ValueError( (EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] ValueError: Free memory on device cuda:0 (6.36/79.25 GiB) on startup is less than desired GPU memory utilization (0.9, 71.33 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes. (EngineCore_DP0 pid=1955796) Process EngineCore_DP0: (EngineCore_DP0 pid=1955796) Traceback (most recent call last): (EngineCore_DP0 pid=1955796) File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=1955796) self.run() (EngineCore_DP0 pid=1955796) File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=1955796) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=1955796) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 950, in run_engine_core (EngineCore_DP0 pid=1955796) raise e (EngineCore_DP0 pid=1955796) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core (EngineCore_DP0 pid=1955796) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=1955796) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 691, in __init__ (EngineCore_DP0 pid=1955796) super().__init__( (EngineCore_DP0 pid=1955796) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 105, in __init__ (EngineCore_DP0 pid=1955796) self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=1955796) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__ (EngineCore_DP0 pid=1955796) self._init_executor() (EngineCore_DP0 pid=1955796) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor (EngineCore_DP0 pid=1955796) self.driver_worker.init_device() (EngineCore_DP0 pid=1955796) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device (EngineCore_DP0 pid=1955796) self.worker.init_device() # type: ignore (EngineCore_DP0 pid=1955796) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 235, in init_device (EngineCore_DP0 pid=1955796) self.requested_memory = request_memory(init_snapshot, self.cache_config) (EngineCore_DP0 pid=1955796) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/utils.py", line 260, in request_memory (EngineCore_DP0 pid=1955796) raise ValueError( (EngineCore_DP0 pid=1955796) ValueError: Free memory on device cuda:0 (6.36/79.25 GiB) on startup is less than desired GPU memory utilization (0.9, 71.33 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes. [rank0]:[W314 21:43:48.991147830 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) Traceback (most recent call last): File "", line 6, in File "/home/ubuntu/.local/lib/python3.10/site-packages/orpheus_tts/engine_class.py", line 13, in __init__ self.engine = self._setup_engine() File "/home/ubuntu/.local/lib/python3.10/site-packages/orpheus_tts/engine_class.py", line 46, in _setup_engine return AsyncLLMEngine.from_engine_args(engine_args) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 257, in from_engine_args return cls( File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 155, in __init__ self.engine_core = EngineCoreClient.make_async_mp_client( File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client return AsyncMPClient(*client_args) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 819, in __init__ super().__init__( File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 479, in __init__ with launch_core_engines(vllm_config, executor_class, log_stats) as ( File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 933, in launch_core_engines wait_for_engine_startup( File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 992, in wait_for_engine_startup raise RuntimeError( RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}