WARNING 03-16 10:51:56 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
Starting vLLM-Omni Qwen3-TTS...
INFO 03-16 10:51:56 [weight_utils.py:50] Using model weights format ['*']
INFO 03-16 10:51:57 [omni.py:181] Initializing stages for model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
INFO 03-16 10:51:57 [omni.py:313] No omni_master_address provided, defaulting to localhost (127.0.0.1)
INFO 03-16 10:51:57 [initialization.py:270] Loaded OmniTransferConfig with 1 connector configurations
INFO 03-16 10:51:57 [factory.py:46] Created connector: SharedMemoryConnector
INFO 03-16 10:51:57 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
INFO 03-16 10:51:57 [omni.py:347] [Orchestrator] Loaded 2 stages
INFO 03-16 10:51:57 [omni.py:458] [Orchestrator] Waiting for 2 stages to initialize (timeout: 300s)
[Stage-1] WARNING 03-16 10:52:04 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
[Stage-1] INFO 03-16 10:52:04 [omni_stage.py:679] Starting stage worker with model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
[Stage-1] INFO 03-16 10:52:04 [omni_stage.py:694] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
[Stage-1] INFO 03-16 10:52:04 [omni_stage.py:725] [Stage-1] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-1] INFO 03-16 10:52:04 [omni_stage.py:79] NVML process-scoped memory available and PID host is available — concurrent init is safe, skipping locks
[Stage-0] WARNING 03-16 10:52:04 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2026-03-16 10:52:04] WARNING configuration_utils.py:697: The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-0] INFO 03-16 10:52:04 [omni_stage.py:679] Starting stage worker with model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
[Stage-0] INFO 03-16 10:52:04 [omni_stage.py:694] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
[Stage-0] INFO 03-16 10:52:04 [omni_stage.py:725] [Stage-0] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-0] INFO 03-16 10:52:04 [omni_stage.py:79] NVML process-scoped memory available and PID host is available — concurrent init is safe, skipping locks
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2026-03-16 10:52:04] WARNING configuration_utils.py:697: The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2026-03-16 10:52:04] WARNING configuration_utils.py:697: The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2026-03-16 10:52:05] WARNING configuration_utils.py:697: The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-1] INFO 03-16 10:52:05 [initialization.py:270] Loaded OmniTransferConfig with 1 connector configurations
[Stage-1] INFO 03-16 10:52:05 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 03-16 10:52:05 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2026-03-16 10:52:05] WARNING configuration_utils.py:697: The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2026-03-16 10:52:05] WARNING configuration_utils.py:697: The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-0] INFO 03-16 10:52:05 [initialization.py:270] Loaded OmniTransferConfig with 1 connector configurations
[Stage-0] INFO 03-16 10:52:05 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-0] INFO 03-16 10:52:05 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2026-03-16 10:52:05] WARNING configuration_utils.py:697: The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-1] INFO 03-16 10:52:05 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
[Stage-1] INFO 03-16 10:52:05 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-1] INFO 03-16 10:52:05 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-1] INFO 03-16 10:52:05 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2026-03-16 10:52:05] WARNING configuration_utils.py:697: The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-0] INFO 03-16 10:52:05 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
[Stage-0] INFO 03-16 10:52:05 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 03-16 10:52:05 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 03-16 10:52:05 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-1] INFO 03-16 10:52:15 [model.py:529] Resolved architecture: Qwen3TTSCode2Wav
[Stage-0] INFO 03-16 10:52:15 [model.py:529] Resolved architecture: Qwen3TTSTalkerForConditionalGeneration
[Stage-1] INFO 03-16 10:52:15 [model.py:1549] Using max model len 32768
[Stage-1] INFO 03-16 10:52:15 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=8192.
[Stage-1] INFO 03-16 10:52:15 [vllm.py:689] Asynchronous scheduling is disabled.
[Stage-1] WARNING 03-16 10:52:15 [vllm.py:727] Enforce eager set, overriding optimization level to -O0
[Stage-1] INFO 03-16 10:52:15 [vllm.py:845] Cudagraph is disabled under eager mode
[Stage-0] INFO 03-16 10:52:15 [model.py:1549] Using max model len 4096
[Stage-0] INFO 03-16 10:52:15 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=512.
[Stage-0] INFO 03-16 10:52:15 [vllm.py:689] Asynchronous scheduling is disabled.
[Stage-1] WARNING 03-16 10:52:23 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
(EngineCore_DP0 pid=2076625) [Stage-1] INFO 03-16 10:52:23 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='Qwen/Qwen3-TTS-12Hz-1.7B-Base', speculative_config=None, tokenizer='Qwen/Qwen3-TTS-12Hz-1.7B-Base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-TTS-12Hz-1.7B-Base, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=2076625) [Stage-1] WARNING 03-16 10:52:23 [multiproc_executor.py:921] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
[Stage-0] WARNING 03-16 10:52:23 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
(EngineCore_DP0 pid=2076628) [Stage-0] INFO 03-16 10:52:23 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='Qwen/Qwen3-TTS-12Hz-1.7B-Base', speculative_config=None, tokenizer='Qwen/Qwen3-TTS-12Hz-1.7B-Base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-TTS-12Hz-1.7B-Base, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=2076628) [Stage-0] WARNING 03-16 10:52:23 [multiproc_executor.py:921] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
[Stage-1] WARNING 03-16 10:52:31 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
[Stage-0] WARNING 03-16 10:52:31 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
[Stage-1] INFO 03-16 10:52:31 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:50131 backend=nccl
[Stage-1] INFO 03-16 10:52:31 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
[Stage-0] INFO 03-16 10:52:31 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:58877 backend=nccl
[Stage-0] INFO 03-16 10:52:31 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
/bin/sh: 1: sox: not found
[2026-03-16 10:52:31] WARNING __init__.py:10: SoX could not be found!

    If you do not have SoX, proceed here:
     - - - http://sox.sourceforge.net/ - - -

    If you do (or think that you should) have SoX, double-check your
    path variables.
    
(Worker pid=2076773) [Stage-1] INFO 03-16 10:52:32 [gpu_model_runner.py:4124] Starting to load model Qwen/Qwen3-TTS-12Hz-1.7B-Base...
/bin/sh: 1: sox: not found
[2026-03-16 10:52:32] WARNING __init__.py:10: SoX could not be found!

    If you do not have SoX, proceed here:
     - - - http://sox.sourceforge.net/ - - -

    If you do (or think that you should) have SoX, double-check your
    path variables.
    
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:32 [gpu_model_runner.py:4124] Starting to load model Qwen/Qwen3-TTS-12Hz-1.7B-Base...
(Worker pid=2076773) [Stage-1] INFO 03-16 10:52:32 [default_loader.py:293] Loading weights took 8312069.88 seconds
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:33 [cuda.py:367] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=2076773) [Stage-1] INFO 03-16 10:52:33 [gpu_model_runner.py:4221] Model loading took 0.0 GiB memory and 0.002091 seconds
(Worker pid=2076773) [Stage-1] INFO 03-16 10:52:33 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
(Worker pid=2076782) <frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=2076782) <frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:33 [vllm.py:689] Asynchronous scheduling is disabled.
(Worker pid=2076773) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker pid=2076773) [2026-03-16 10:52:33] WARNING logging.py:328: `torch_dtype` is deprecated! Use `dtype` instead!
(Worker pid=2076773) [Stage-1] INFO 03-16 10:52:33 [configuration_qwen3_tts_tokenizer_v2.py:156] encoder_config is None. Initializing encoder with default values
(Worker pid=2076773) [Stage-1] INFO 03-16 10:52:33 [configuration_qwen3_tts_tokenizer_v2.py:159] decoder_config is None. Initializing decoder with default values
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:33 [weight_utils.py:579] No model.safetensors.index.json found in remote.
(Worker pid=2076782) Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(Worker pid=2076773) [Stage-1] WARNING 03-16 10:52:33 [qwen3_tts_code2wav.py:196] Code2Wav input_ids length 3 not divisible by num_quantizers 16, likely a warmup run; returning empty audio.
(Worker pid=2076773) [Stage-1] WARNING 03-16 10:52:33 [gpu_generation_model_runner.py:451] Dummy sampler run is not implemented for generation model
(EngineCore_DP0 pid=2076625) [Stage-1] INFO 03-16 10:52:33 [core.py:278] init engine (profile, create kv cache, warmup model) took 0.73 seconds
(Worker pid=2076782) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.53it/s]
(Worker pid=2076782) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.53it/s]
(Worker pid=2076782) 
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:34 [qwen3_tts_talker.py:1534] Loaded 381 weights for Qwen3TTSTalkerForConditionalGeneration
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:34 [default_loader.py:293] Loading weights took 0.83 seconds
(EngineCore_DP0 pid=2076625) [Stage-1] WARNING 03-16 10:52:34 [scheduler.py:166] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=2076625) [Stage-1] WARNING 03-16 10:52:34 [core.py:130] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=2076625) [Stage-1] INFO 03-16 10:52:34 [factory.py:46] Created connector: SharedMemoryConnector
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:35 [gpu_model_runner.py:4221] Model loading took 3.62 GiB memory and 1.876786 seconds
(EngineCore_DP0 pid=2076625) [Stage-1] INFO 03-16 10:52:35 [vllm.py:689] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=2076625) [Stage-1] WARNING 03-16 10:52:35 [vllm.py:734] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=2076625) [Stage-1] INFO 03-16 10:52:35 [vllm.py:845] Cudagraph is disabled under eager mode
[Stage-1] INFO 03-16 10:52:35 [omni_llm.py:173] Supported_tasks: ['generate']
[Stage-1] INFO 03-16 10:52:35 [initialization.py:324] [Stage-1] Initializing OmniConnectors with config keys: ['from_stage_0']
[Stage-1] INFO 03-16 10:52:35 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 03-16 10:52:35 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-1] INFO 03-16 10:52:35 [omni_stage.py:794] Max batch size: 4
INFO 03-16 10:52:35 [omni.py:448] [Orchestrator] Stage-1 reported ready
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:41 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:41 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:41 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:41 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:41 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:42 [backends.py:916] Using cache directory: /home/ubuntu/.cache/vllm/torch_compile_cache/575bed9f25/rank_0_0/backbone for vLLM's torch.compile
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:42 [backends.py:976] Dynamo bytecode transform time: 6.51 s
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:53 [backends.py:351] Cache the graph of compile range (1, 512) for later use
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:59 [backends.py:368] Compiling a graph for compile range (1, 512) takes 12.98 s
(Worker pid=2076782) [Stage-0] INFO 03-16 10:52:59 [monitor.py:34] torch.compile takes 19.49 s in total
(Worker pid=2076782) [Stage-0] INFO 03-16 10:53:00 [base.py:81] Available KV cache memory: 19.62 GiB (process-scoped)
(EngineCore_DP0 pid=2076628) [Stage-0] INFO 03-16 10:53:00 [kv_cache_utils.py:1307] GPU KV cache size: 183,648 tokens
(EngineCore_DP0 pid=2076628) [Stage-0] INFO 03-16 10:53:00 [kv_cache_utils.py:1312] Maximum concurrency for 4,096 tokens per request: 44.84x
(Worker pid=2076782) Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/4 [00:00<?, ?it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  50%|█████     | 2/4 [00:00<00:00, 19.83it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 4/4 [00:00<00:00, 19.96it/s]
(Worker pid=2076782) Capturing CUDA graphs (decode, FULL):   0%|          | 0/3 [00:00<?, ?it/s]Capturing CUDA graphs (decode, FULL): 100%|██████████| 3/3 [00:00<00:00, 23.26it/s]Capturing CUDA graphs (decode, FULL): 100%|██████████| 3/3 [00:00<00:00, 23.21it/s]
(Worker pid=2076782) [Stage-0] INFO 03-16 10:53:02 [gpu_model_runner.py:5246] Graph capturing finished in 1 secs, took 0.06 GiB
(EngineCore_DP0 pid=2076628) [Stage-0] INFO 03-16 10:53:02 [core.py:278] init engine (profile, create kv cache, warmup model) took 26.90 seconds
(EngineCore_DP0 pid=2076628) [Stage-0] WARNING 03-16 10:53:03 [scheduler.py:166] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=2076628) [Stage-0] INFO 03-16 10:53:03 [factory.py:46] Created connector: SharedMemoryConnector
(EngineCore_DP0 pid=2076628) [Stage-0] INFO 03-16 10:53:03 [vllm.py:689] Asynchronous scheduling is disabled.
[Stage-0] INFO 03-16 10:53:03 [omni_llm.py:173] Supported_tasks: ['generate']
[Stage-0] INFO 03-16 10:53:03 [initialization.py:324] [Stage-0] Initializing OmniConnectors with config keys: ['to_stage_1']
[Stage-0] INFO 03-16 10:53:03 [omni_stage.py:794] Max batch size: 4
INFO 03-16 10:53:03 [omni.py:448] [Orchestrator] Stage-0 reported ready
INFO 03-16 10:53:03 [omni.py:477] [Orchestrator] All stages initialized successfully
Engine ready.
Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]
Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s][A[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054] Failed on batch ['0_b62fa383-86ba-4b4c-9d59-22931be4d204']: Prompt dictionary must contain text, tokens, or embeddings
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054] Traceback (most recent call last):
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm_omni/entrypoints/omni_stage.py", line 974, in _stage_worker
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]     results = stage_engine.generate(
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 449, in generate
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]     outputs = self._run_completion(
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1744, in _run_completion
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]     engine_prompts = self._preprocess_completion(
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 822, in _preprocess_completion
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]     parsed_prompts = [
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 823, in <listcomp>
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]     parse_model_prompt(model_config, prompt) for prompt in prompts
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/renderers/inputs/preprocess.py", line 219, in parse_model_prompt
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]     return parse_dec_only_prompt(prompt)
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/renderers/inputs/preprocess.py", line 142, in parse_dec_only_prompt
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054]     raise TypeError("Prompt dictionary must contain text, tokens, or embeddings")
[Stage-0] ERROR 03-16 10:53:13 [omni_stage.py:1054] TypeError: Prompt dictionary must contain text, tokens, or embeddings
ERROR 03-16 10:53:13 [omni.py:1007] [Orchestrator] Stage 0 error on request 0_b62fa383-86ba-4b4c-9d59-22931be4d204: Prompt dictionary must contain text, tokens, or embeddings
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [multiproc_executor.py:247] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [multiproc_executor.py:247] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.
(EngineCore_DP0 pid=2076625) /usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
(EngineCore_DP0 pid=2076625)   warnings.warn('resource_tracker: process died unexpectedly, '
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008] Traceback (most recent call last):
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 999, in run_engine_core
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1024, in run_busy_loop
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008]     self._process_input_queue()
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1046, in _process_input_queue
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008]     self._handle_client_request(*req)
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1102, in _handle_client_request
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008]     raise RuntimeError("Executor failed.")
(EngineCore_DP0 pid=2076625) [Stage-1] ERROR 03-16 11:37:24 [core.py:1008] RuntimeError: Executor failed.
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
    cache[rtype].remove(name)
KeyError: '/psm_3ecf6650'
(EngineCore_DP0 pid=2076628) /usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
(EngineCore_DP0 pid=2076628)   warnings.warn('resource_tracker: process died unexpectedly, '
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008] Traceback (most recent call last):
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 999, in run_engine_core
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1024, in run_busy_loop
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008]     self._process_input_queue()
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1046, in _process_input_queue
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008]     self._handle_client_request(*req)
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008]   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1102, in _handle_client_request
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008]     raise RuntimeError("Executor failed.")
(EngineCore_DP0 pid=2076628) [Stage-0] ERROR 03-16 11:37:24 [core.py:1008] RuntimeError: Executor failed.
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
    cache[rtype].remove(name)
KeyError: '/psm_2063e1d8'
(EngineCore_DP0 pid=2076625) Process EngineCore_DP0:
(EngineCore_DP0 pid=2076625) Traceback (most recent call last):
(EngineCore_DP0 pid=2076625)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=2076625)     self.run()
(EngineCore_DP0 pid=2076625)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=2076625)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=2076625)   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1010, in run_engine_core
(EngineCore_DP0 pid=2076625)     raise e
(EngineCore_DP0 pid=2076625)   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 999, in run_engine_core
(EngineCore_DP0 pid=2076625)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=2076625)   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1024, in run_busy_loop
(EngineCore_DP0 pid=2076625)     self._process_input_queue()
(EngineCore_DP0 pid=2076625)   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1046, in _process_input_queue
(EngineCore_DP0 pid=2076625)     self._handle_client_request(*req)
(EngineCore_DP0 pid=2076625)   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1102, in _handle_client_request
(EngineCore_DP0 pid=2076625)     raise RuntimeError("Executor failed.")
(EngineCore_DP0 pid=2076625) RuntimeError: Executor failed.
(EngineCore_DP0 pid=2076628) Process EngineCore_DP0:
(EngineCore_DP0 pid=2076628) Traceback (most recent call last):
(EngineCore_DP0 pid=2076628)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=2076628)     self.run()
(EngineCore_DP0 pid=2076628)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=2076628)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=2076628)   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1010, in run_engine_core
(EngineCore_DP0 pid=2076628)     raise e
(EngineCore_DP0 pid=2076628)   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 999, in run_engine_core
(EngineCore_DP0 pid=2076628)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=2076628)   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1024, in run_busy_loop
(EngineCore_DP0 pid=2076628)     self._process_input_queue()
(EngineCore_DP0 pid=2076628)   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1046, in _process_input_queue
(EngineCore_DP0 pid=2076628)     self._handle_client_request(*req)
(EngineCore_DP0 pid=2076628)   File "/home/ubuntu/vllm_env/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1102, in _handle_client_request
(EngineCore_DP0 pid=2076628)     raise RuntimeError("Executor failed.")
(EngineCore_DP0 pid=2076628) RuntimeError: Executor failed.