APEX FusedRMSNorm not available, using native implementation /home/ubuntu/vibevoice/vibevoice/processor/vibevoice_asr_processor.py:23: UserWarning: audio_utils not available, will fall back to soundfile for audio loading warnings.warn("audio_utils not available, will fall back to soundfile for audio loading") 03/16/2026 06:47:26 - INFO - __main__ - Training/evaluation parameters CustomTrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, ce_loss_weight=0.04, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, ddpm_batch_mul=4, debug=[], debug_ce_details=False, debug_ce_every_n_steps=200, debug_ce_max_examples=1, debug_ce_topk=5, debug_save=False, deepspeed=None, diffusion_loss_weight=1.4, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=None, eval_strategy=no, eval_use_gather_object=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, gradient_clipping=True, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2.5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/home/ubuntu/vibevoice_finetune_output/runs/Mar16_06-47-25_0321-dsm2-nvdgxa100-prxmx70052, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=0.8, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=/home/ubuntu/vibevoice_finetune_output, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=[], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=/home/ubuntu/vibevoice_finetune_output, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=None, seed=42, skip_memory_metrics=True, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.0, ) 03/16/2026 06:47:26 - INFO - __main__ - Gradient clipping enabled: max_grad_norm=0.8 03/16/2026 06:47:26 - INFO - vibevoice.processor.vibevoice_processor - Loading tokenizer from Qwen/Qwen2.5-1.5B The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'Qwen2Tokenizer'. The class this function is called from is 'VibeVoiceTextTokenizerFast'. 03/16/2026 06:47:26 - WARNING - transformers.tokenization_utils_base - The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'Qwen2Tokenizer'. The class this function is called from is 'VibeVoiceTextTokenizerFast'. Tied input and output embeddings using standard assignment. Loading checkpoint shards: 0%| | 0/3 [00:00 shared_params=True, values_equal=True, tie_word_embeddings=True 03/16/2026 06:47:29 - INFO - __main__ - LM head requires_grad before freeze: True 03/16/2026 06:47:29 - INFO - __main__ - Special token check -> speech_start_id=151652, decoded='<|vision_start|>', exists=True, in_vocab_range=True, emb_vs_head_row_equal=True 03/16/2026 06:47:29 - INFO - __main__ - Special token check -> speech_diffusion_id=151654, decoded='<|vision_pad|>', exists=True, in_vocab_range=True, emb_vs_head_row_equal=True 03/16/2026 06:47:29 - INFO - __main__ - Special token check -> speech_end_id=151653, decoded='<|vision_end|>', exists=True, in_vocab_range=True, emb_vs_head_row_equal=True 03/16/2026 06:47:29 - INFO - __main__ - === TOKENIZER DIAGNOSTICS === 03/16/2026 06:47:29 - INFO - __main__ - Tokenizer class: VibeVoiceTextTokenizerFast 03/16/2026 06:47:29 - INFO - __main__ - Tokenizer vocab_size: 151643 03/16/2026 06:47:30 - INFO - __main__ - Simple text CE loss: 14.8125 Tied input and output embeddings using standard assignment. 03/16/2026 06:47:30 - INFO - __main__ - Trainable by block -> LLM-LoRA: 9,232,384 | diff_head: 123,279,360 | ac_conn: 0 | se_conn: 0 03/16/2026 06:47:30 - INFO - __main__ - TOTAL trainable: 132,511,744 03/16/2026 06:47:31 - INFO - __main__ - LoRA debug: found 392 LoRA params (A=196, B=196); trainable=392. Initial lora_B_zero=196. 0%| | 0/730 [00:00