COMPARISON: VibeVoice vs Fish S2 Pro vs Sooktam-2
Reference audio: /home/ubuntu/vibevoice/demo/voices/modi.wav
Language: Hindi

============================================================
SOOKTAM-2 (BharatGen)
============================================================
Running setup-cls.sh...
  setup-cls.sh stderr: E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?

Loading Sooktam-2 model...
Encountered exception while importing f5_tts: No module named 'f5_tts'
  Sooktam-2 failed: This modeling file requires the following packages that were not found in your environment: f5_tts. Run `pip install f5_tts`
Traceback (most recent call last):
  File "/home/ubuntu/compare_tts.py", line 81, in test_sooktam2
    model = AutoModel.from_pretrained(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 549, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1346, in from_pretrained
    config_class = get_class_from_dynamic_module(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 604, in get_class_from_dynamic_module
    final_module = get_cached_module_file(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 427, in get_cached_module_file
    modules_needed = check_imports(resolved_module_file)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 260, in check_imports
    raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: f5_tts. Run `pip install f5_tts`

============================================================
VIBEVOICE 1.5B (our current best)
============================================================
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773512317.647956 1891727 cpu_feature_guard.cc:227] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/home/ubuntu/.local/lib/python3.10/site-packages/google/api_core/_python_version_support.py:275: FutureWarning: You are using a Python version (3.10.12) which Google will stop supporting in new releases of google.api_core once it reaches its end of life (2026-10-04). Please upgrade to the latest Python version, or at least Python 3.11, to continue receiving updates for google.api_core past that date.
  warnings.warn(message, FutureWarning)
APEX FusedRMSNorm not available, using native implementation
/home/ubuntu/vibevoice/vibevoice/processor/vibevoice_asr_processor.py:23: UserWarning: audio_utils not available, will fall back to soundfile for audio loading
  warnings.warn("audio_utils not available, will fall back to soundfile for audio loading")
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Qwen2Tokenizer'. 
The class this function is called from is 'VibeVoiceTextTokenizerFast'.
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]Loading checkpoint shards:  33%|███▎      | 1/3 [00:00<00:00,  2.81it/s]Loading checkpoint shards:  67%|██████▋   | 2/3 [00:00<00:00,  2.39it/s]Loading checkpoint shards: 100%|██████████| 3/3 [00:01<00:00,  2.41it/s]Loading checkpoint shards: 100%|██████████| 3/3 [00:01<00:00,  2.44it/s]
  VibeVoice failed: GenerationMixin._prepare_cache_for_generation() takes 6 positional arguments but 7 were given
Traceback (most recent call last):
  File "/home/ubuntu/compare_tts.py", line 135, in test_vibevoice
    _ = model.generate(**inp, max_new_tokens=None, cfg_scale=1.3, tokenizer=processor.tokenizer,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/vibevoice/vibevoice/modular/modeling_vibevoice_inference.py", line 375, in generate
    generation_config, model_kwargs, input_ids, logits_processor, stopping_criteria = self._build_generate_config_model_kwargs(
  File "/home/ubuntu/vibevoice/vibevoice/modular/modeling_vibevoice_inference.py", line 303, in _build_generate_config_model_kwargs
    self._prepare_cache_for_generation(generation_config, model_kwargs, None, batch_size, max_cache_length, device)
TypeError: GenerationMixin._prepare_cache_for_generation() takes 6 positional arguments but 7 were given

============================================================
FISH AUDIO S2 PRO
============================================================
  Fish S2 Pro failed: No module named 'fish_speech.inference'
  Trying alternative approach...
  Alternative also failed: cannot import name 'launch' from 'fish_speech.models.text2semantic.inference' (/home/ubuntu/.local/lib/python3.10/site-packages/fish_speech/models/text2semantic/inference.py)

============================================================
All samples saved in /home/ubuntu/comparison_samples/
============================================================