Loading Orpheus Hindi 3B...
INFO 03-14 21:43:37 [model.py:541] Resolved architecture: LlamaForCausalLM
INFO 03-14 21:43:37 [model.py:1882] Downcasting torch.float32 to torch.bfloat16.
INFO 03-14 21:43:37 [model.py:1561] Using max model len 131072
INFO 03-14 21:43:37 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-14 21:43:37 [vllm.py:624] Asynchronous scheduling is enabled.
WARNING 03-14 21:43:37 [vllm.py:662] Enforce eager set, overriding optimization level to -O0
INFO 03-14 21:43:37 [vllm.py:762] Cudagraph is disabled under eager mode
WARNING 03-14 21:43:39 [system_utils.py:140] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
(EngineCore_DP0 pid=1955796) INFO 03-14 21:43:44 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='canopylabs/3b-hi-ft-research_release', speculative_config=None, tokenizer='canopylabs/3b-hi-ft-research_release', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=canopylabs/3b-hi-ft-research_release, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=1955796) INFO 03-14 21:43:47 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://216.81.248.184:57725 backend=nccl
(EngineCore_DP0 pid=1955796) INFO 03-14 21:43:47 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] EngineCore failed to start.
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] Traceback (most recent call last):
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]     super().__init__(
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]     self._init_executor()
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]     self.driver_worker.init_device()
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]     self.worker.init_device()  # type: ignore
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 235, in init_device
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/utils.py", line 260, in request_memory
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946]     raise ValueError(
(EngineCore_DP0 pid=1955796) ERROR 03-14 21:43:47 [core.py:946] ValueError: Free memory on device cuda:0 (6.36/79.25 GiB) on startup is less than desired GPU memory utilization (0.9, 71.33 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
(EngineCore_DP0 pid=1955796) Process EngineCore_DP0:
(EngineCore_DP0 pid=1955796) Traceback (most recent call last):
(EngineCore_DP0 pid=1955796)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=1955796)     self.run()
(EngineCore_DP0 pid=1955796)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=1955796)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=1955796)   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 950, in run_engine_core
(EngineCore_DP0 pid=1955796)     raise e
(EngineCore_DP0 pid=1955796)   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=1955796)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=1955796)   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=1955796)     super().__init__(
(EngineCore_DP0 pid=1955796)   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=1955796)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1955796)   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=1955796)     self._init_executor()
(EngineCore_DP0 pid=1955796)   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
(EngineCore_DP0 pid=1955796)     self.driver_worker.init_device()
(EngineCore_DP0 pid=1955796)   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
(EngineCore_DP0 pid=1955796)     self.worker.init_device()  # type: ignore
(EngineCore_DP0 pid=1955796)   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 235, in init_device
(EngineCore_DP0 pid=1955796)     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore_DP0 pid=1955796)   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/worker/utils.py", line 260, in request_memory
(EngineCore_DP0 pid=1955796)     raise ValueError(
(EngineCore_DP0 pid=1955796) ValueError: Free memory on device cuda:0 (6.36/79.25 GiB) on startup is less than desired GPU memory utilization (0.9, 71.33 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
[rank0]:[W314 21:43:48.991147830 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "<string>", line 6, in <module>
  File "/home/ubuntu/.local/lib/python3.10/site-packages/orpheus_tts/engine_class.py", line 13, in __init__
    self.engine = self._setup_engine()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/orpheus_tts/engine_class.py", line 46, in _setup_engine
    return AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 257, in from_engine_args
    return cls(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 155, in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
    return AsyncMPClient(*client_args)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 819, in __init__
    super().__init__(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 479, in __init__
    with launch_core_engines(vllm_config, executor_class, log_stats) as (
  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 933, in launch_core_engines
    wait_for_engine_startup(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 992, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}