APEX FusedRMSNorm not available, using native implementation
/home/ubuntu/vibevoice/vibevoice/processor/vibevoice_asr_processor.py:23: UserWarning: audio_utils not available, will fall back to soundfile for audio loading
  warnings.warn("audio_utils not available, will fall back to soundfile for audio loading")
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Qwen2Tokenizer'. 
The class this function is called from is 'VibeVoiceTextTokenizerFast'.
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]Loading checkpoint shards:  33%|███▎      | 1/3 [00:00<00:01,  1.77it/s]Loading checkpoint shards:  67%|██████▋   | 2/3 [00:01<00:00,  1.95it/s]Loading checkpoint shards: 100%|██████████| 3/3 [00:01<00:00,  2.08it/s]Loading checkpoint shards: 100%|██████████| 3/3 [00:01<00:00,  2.02it/s]
Warming up baseline...

[1] Baseline: cfg=1.3, 20 steps
    RTF=1.169x  TTFB=164ms  Audio=12.67s

[2] cfg_scale=1.0, 20 steps
    RTF=1.162x  TTFB=166ms  Audio=12.80s

[3] cfg=1.3, 10 steps
    RTF=0.955x  TTFB=140ms  Audio=11.60s

[4] cfg=1.0, 10 steps
    RTF=0.951x  TTFB=140ms  Audio=10.93s

[5] torch.compile LM + cfg=1.3, 20 steps
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0] Graph break from `Tensor.item()`, consider setting:
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0]     torch._dynamo.config.capture_scalar_outputs = True
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0] or:
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0]     env TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0] to include these operations in the captured graph.
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0] 
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0] Graph break: from user code at:
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0]   File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0]     output = func(self, *args, **kwargs)
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0]   File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 519, in forward
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0]     causal_mask = self._update_causal_mask(
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0]   File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 589, in _update_causal_mask
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0]     is_padding_right = attention_mask[:, -1].sum().item() != input_tensor.size()[0]
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0] 
W0314 15:52:26.156000 1841294 torch/_dynamo/variables/tensor.py:776] [0/0] 
/home/ubuntu/.local/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:167: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
W0314 15:52:37.446000 1841294 torch/_dynamo/convert_frame.py:844] [7/8] torch._dynamo hit config.cache_size_limit (8)
W0314 15:52:37.446000 1841294 torch/_dynamo/convert_frame.py:844] [7/8]    function: 'forward' (/home/ubuntu/.local/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:152)
W0314 15:52:37.446000 1841294 torch/_dynamo/convert_frame.py:844] [7/8]    last reason: 7/0: L['self'].layer_idx == 0                                    
W0314 15:52:37.446000 1841294 torch/_dynamo/convert_frame.py:844] [7/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0314 15:52:37.446000 1841294 torch/_dynamo/convert_frame.py:844] [7/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0314 15:52:41.587000 1841294 torch/_dynamo/convert_frame.py:844] [6/8] torch._dynamo hit config.cache_size_limit (8)
W0314 15:52:41.587000 1841294 torch/_dynamo/convert_frame.py:844] [6/8]    function: 'forward' (/home/ubuntu/.local/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:245)
W0314 15:52:41.587000 1841294 torch/_dynamo/convert_frame.py:844] [6/8]    last reason: 6/0: len(L['kwargs']) == 5                                       
W0314 15:52:41.587000 1841294 torch/_dynamo/convert_frame.py:844] [6/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0314 15:52:41.587000 1841294 torch/_dynamo/convert_frame.py:844] [6/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
Traceback (most recent call last):
  File "/home/ubuntu/vibevoice/rtf_optimize.py", line 153, in <module>
    main()
  File "/home/ubuntu/vibevoice/rtf_optimize.py", line 116, in main
    _ = measure(model, processor, voice_path, "Speaker 1: warmup test.", ddpm_steps=20, cfg_scale=1.3)
  File "/home/ubuntu/vibevoice/rtf_optimize.py", line 50, in measure
    outputs = model.generate(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/vibevoice/vibevoice/modular/modeling_vibevoice_inference.py", line 628, in generate
    positive_condition = outputs.last_hidden_state[diffusion_indices, -1, :]
RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 259, in forward
    hidden_states = self.input_layernorm(hidden_states)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 225, in forward
    return self.weight * hidden_states.to(input_dtype). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.