# Benchmark Results

Dataset: `data/bench/maya_hf_hin_v2` (Hindi, 200 segments)
Metric: mean WER (lower is better); CER included for context.
Inference: vLLM audio transcription, Whisper skipped.

## How We Ran These Experiments
- Pipeline: preprocess -> MMS-LID -> ASR -> ROVER -> verify. Language routing is forced from the path (`/hin/`), so Voxtral receives `language=hi`.
- Final selection in the pipeline: `final` defaults to `rover.text` unless CTC scoring or explicit router preferences are enabled. We did not enable CTC or router overrides in these runs, so `final` == ROVER output.
- Verification (accept/human) only labels segments; it does not change the transcript.

### ROVER / Ensemble Logic (inside pipeline)
- Implemented in `maya_transcribe/rover.py`:
  - Tokenize each hypothesis.
  - Choose a “centroid” hypothesis (lowest average normalized edit distance to others).
  - Align all hypotheses to the centroid.
  - Vote per token position; ties fall back to the centroid token.
  - Gap insertions are kept only if >=50% support.
- With two hypotheses, ROVER tends to pick the centroid transcript and keeps only gaps supported by both.

### Voxtral + Gemma Offline Ensemble
- We merged `results.jsonl` from Voxtral and Gemma, then ran ROVER on their `final` texts only.
- No verification gating or CTC reranking in that offline ensemble; the output is purely ROVER.

### Prompts / Decoding
- Voxtral transcription mode: vLLM’s Voxtral `TranscriptionRequest` path (audio-only, no extra text prompt). Language is set to `hi` from path routing.
- Prompt.md override: when supplied, Voxtral uses a chat prompt with text + audio; Gemma uses a custom `<start_of_turn>` prompt with `<audio_soft_token>`. `<language>` is replaced with the model-supported language name (Hindi). Output is post-processed to strip the `Output 1: Transcript` header line.
- Gemma-3n transcription mode: vLLM’s built-in Gemma3n prompt template:
  - `"Transcribe this audio into Hindi: <audio_soft_token><end_of_turn>\n<start_of_turn>model\n"`
- Voxtral proofread prompt (audio + pass1 transcript, chat mode):
  - `"Correct transcription errors using the audio. Keep the same language/script. Do not add new content. Output only the corrected transcript.\n\nTranscript:\n{transcript}"`
- Decoding: `temperature=0.0`, `top_p=1.0`. For Gemma-3n we also add stop token `<end_of_turn>` and cap output length to 12 tokens/sec (min 32, max 1024).

## Voxtral
- Voxtral-Small-24B-2507: WER 14.87%, CER 5.49% (best so far).
  - Eval: `outputs/bench/maya_hf_hin_v2/voxtral_small_vllm/eval.json`
  - Command: `python3 -m maya_transcribe.cli bench run --bench-dir data/bench/maya_hf_hin_v2 --out outputs/bench/maya_hf_hin_v2/voxtral_small_vllm --skip-whisper --vllm-asr-model mistralai/Voxtral-Small-24B-2507 --vllm-gpu-memory-utilization 0.9 --vllm-max-model-len 4096`
- Voxtral-Small-24B-2507 (2-pass proofread, audio + pass1 transcript via chat): final WER 16.62%, CER 7.32%. Raw pass stayed competitive, proofread pass degraded.
  - Final eval: `outputs/bench/maya_hf_hin_v2/voxtral_small_vllm_proofread/eval.json`
  - Raw eval: `outputs/bench/maya_hf_hin_v2/voxtral_small_vllm_proofread/eval_raw.json` (WER 15.03%, CER 5.56%)
  - Proofread eval: `outputs/bench/maya_hf_hin_v2/voxtral_small_vllm_proofread/eval_proofread.json` (WER 24.58%, CER 17.59%)
  - Command: `python3 -m maya_transcribe.cli bench run --bench-dir data/bench/maya_hf_hin_v2 --out outputs/bench/maya_hf_hin_v2/voxtral_small_vllm_proofread --skip-whisper --vllm-asr-model mistralai/Voxtral-Small-24B-2507 --vllm-gpu-memory-utilization 0.9 --vllm-max-model-len 4096 --vllm-proofread`
- Voxtral-Mini-3B-2507: WER 19.32%, CER 7.85%.
  - Eval: `outputs/bench/maya_hf_hin_v2/voxtral_mini_vllm/eval.json`
  - Command: `python3 -m maya_transcribe.cli bench run --bench-dir data/bench/maya_hf_hin_v2 --out outputs/bench/maya_hf_hin_v2/voxtral_mini_vllm --skip-whisper --vllm-asr-model mistralai/Voxtral-Mini-3B-2507 --vllm-gpu-memory-utilization 0.15 --vllm-max-model-len 4096`

## Gemma-3n
- Gemma-3n-E2B-it (initial run): WER 137.6%, CER 105.2%. This was dominated by runaway repetitions (hundreds/thousands of tokens) when the model did not emit a stop token.
  - Eval: `outputs/bench/maya_hf_hin_v2/gemma3n_e2b_vllm/eval.json`
- Gemma-3n-E2B-it (tuned stop token + dynamic max_tokens): WER 20.63%, CER 10.05%. The runaway repetition is capped, but accuracy still lags Voxtral.
  - Eval: `outputs/bench/maya_hf_hin_v2/gemma3n_e2b_vllm_tuned/eval.json`
  - Command: `python3 -m maya_transcribe.cli bench run --bench-dir data/bench/maya_hf_hin_v2 --out outputs/bench/maya_hf_hin_v2/gemma3n_e2b_vllm_tuned --skip-whisper --vllm-asr-model google/gemma-3n-E2B-it --vllm-gpu-memory-utilization 0.9 --vllm-max-model-len 4096`

## Ensembles
- Voxtral-Small + Gemma-3n (ROVER on both finals): WER 20.81%, CER 11.04% (worse than Voxtral alone).
  - Eval: `outputs/bench/maya_hf_hin_v2/voxtral_gemma_ensemble/eval.json`

## Prompt.md Experiments (custom instruction prompt)
Prompt source: `Prompt.md` (strict verbatim transcription, `Output 1: Transcript` format). All runs below used the same Hindi bench (`data/bench/maya_hf_hin_v2`).

### Voxtral (Prompt.md)
- Voxtral-Small-24B-2507: WER 24.52%, CER 17.03% (worse than baseline).
  - Eval: `outputs/bench/maya_hf_hin_v2/voxtral_small_vllm_prompt/eval.json`
  - Command: `python3 -m maya_transcribe.cli bench run --bench-dir data/bench/maya_hf_hin_v2 --out outputs/bench/maya_hf_hin_v2/voxtral_small_vllm_prompt --skip-whisper --vllm-asr-model mistralai/Voxtral-Small-24B-2507 --vllm-gpu-memory-utilization 0.9 --vllm-max-model-len 4096 --vllm-transcribe-prompt-file Prompt.md`
- Voxtral-Small-24B-2507 (2-pass proofread + Prompt.md on pass1): final WER 28.50%, CER 21.95% (worse than baseline).
  - Final eval: `outputs/bench/maya_hf_hin_v2/voxtral_small_vllm_proofread_prompt/eval.json`
  - Raw eval: `outputs/bench/maya_hf_hin_v2/voxtral_small_vllm_proofread_prompt/eval_raw.json` (WER 25.65%, CER 18.16%)
  - Proofread eval: `outputs/bench/maya_hf_hin_v2/voxtral_small_vllm_proofread_prompt/eval_proofread.json` (WER 31.84%, CER 27.36%)
  - Command: `python3 -m maya_transcribe.cli bench run --bench-dir data/bench/maya_hf_hin_v2 --out outputs/bench/maya_hf_hin_v2/voxtral_small_vllm_proofread_prompt --skip-whisper --vllm-asr-model mistralai/Voxtral-Small-24B-2507 --vllm-gpu-memory-utilization 0.9 --vllm-max-model-len 4096 --vllm-proofread --vllm-transcribe-prompt-file Prompt.md`
- Voxtral-Mini-3B-2507: WER 98.49%, CER 99.88% (prompt caused severe degradation).
  - Eval: `outputs/bench/maya_hf_hin_v2/voxtral_mini_vllm_prompt/eval.json`
  - Command: `python3 -m maya_transcribe.cli bench run --bench-dir data/bench/maya_hf_hin_v2 --out outputs/bench/maya_hf_hin_v2/voxtral_mini_vllm_prompt --skip-whisper --vllm-asr-model mistralai/Voxtral-Mini-3B-2507 --vllm-gpu-memory-utilization 0.9 --vllm-max-model-len 4096 --vllm-transcribe-prompt-file Prompt.md`

### Gemma-3n (Prompt.md)
- Gemma-3n-E2B-it: WER 24.26%, CER 13.90% (worse than tuned Gemma with the built-in prompt).
  - Eval: `outputs/bench/maya_hf_hin_v2/gemma3n_e2b_vllm_prompt/eval.json`
  - Command: `python3 -m maya_transcribe.cli bench run --bench-dir data/bench/maya_hf_hin_v2 --out outputs/bench/maya_hf_hin_v2/gemma3n_e2b_vllm_prompt --skip-whisper --vllm-asr-model google/gemma-3n-E2B-it --vllm-gpu-memory-utilization 0.9 --vllm-max-model-len 4096 --vllm-transcribe-prompt-file Prompt.md`

### Ensemble (Prompt.md)
- Voxtral-Small + Gemma-3n (ROVER on both finals): WER 26.82%, CER 18.12% (worse than Voxtral prompt alone).
  - Eval: `outputs/bench/maya_hf_hin_v2/voxtral_gemma_ensemble_prompt/eval.json`
  - Results: `outputs/bench/maya_hf_hin_v2/voxtral_gemma_ensemble_prompt/results.jsonl`

## Notes
- Gemma-3n now uses a per-audio max_tokens cap (12 tokens/sec, max 1024) and a Gemma stop token (`<end_of_turn>`) to prevent runaway decoding.
- Best WER on this Hindi bench is currently Voxtral-Small (14.87%).

## Multilingual 100-sample Runs (Transcription Dumps)
Bench subset: `data/bench/maya_hf_v2_100` (12 languages x 100 samples). These runs include JSON dumps + eval.json for per-language WER.

- IndicConformer 600M multilingual (HF): `outputs/bench/maya_hf_v2_100/indicconformer_600m_hf/results.jsonl`
  - Note: English segments failed with `KeyError:'joint_post_net_en'` (100 error rows).
- Whisper large-v3: `outputs/bench/maya_hf_v2_100/whisper_large_v3/results.jsonl`
- Voxtral-Small-24B-2507: `outputs/bench/maya_hf_v2_100/voxtral_small_vllm/results.jsonl`
  - Note: Voxtral only natively supports a small language set (hi/en/...) so non-supported languages likely degrade.
- Gemma-3n-E2B-it: `outputs/bench/maya_hf_v2_100/gemma3n_e2b_vllm/results.jsonl`
- OmniASR LLM 7B v2: `outputs/bench/maya_hf_v2_100/omniasr_llm_7b_v2/results.jsonl`
- Seamless M4T v2 large: `outputs/bench/maya_hf_v2_100/seamless_m4t_v2_large/results.jsonl`

### Per-language WER (mean, %)
Source: `eval.json` from each run. Note: IndicConformer HF failed on English (`n/a` below), so its overall WER is computed on 1100 segments.

| Lang | IndicConformer-600M (HF) | Whisper large-v3 | Voxtral-Small | Gemma-3n-E2B | OmniASR LLM 7B v2 | Seamless M4T v2 large |
| --- | --- | --- | --- | --- | --- | --- |
| asm | 19.9 | 95.9 | 107.9 | 78.0 | 21.8 | 29.9 |
| ben | 11.2 | 64.2 | 37.4 | 34.8 | 13.9 | 17.5 |
| eng | n/a | 4.5 | 3.8 | 9.2 | 6.0 | 7.7 |
| guj | 11.7 | 49.3 | 49.8 | 87.3 | 16.3 | 19.4 |
| hin | 9.2 | 30.9 | 14.9 | 18.8 | 12.6 | 25.4 |
| kan | 14.3 | 71.7 | 49.0 | 45.8 | 20.4 | 30.5 |
| mal | 29.9 | 122.5 | 62.6 | 65.1 | 41.2 | 42.6 |
| mar | 14.6 | 79.7 | 39.9 | 38.8 | 25.1 | 24.1 |
| ory | 16.3 | 115.5 | 130.3 | 120.0 | 29.8 | 31.5 |
| pan | 12.6 | 64.7 | 34.2 | 135.0 | 16.0 | 17.4 |
| tam | 20.1 | 50.9 | 52.7 | 78.8 | 32.9 | 31.5 |
| tel | 24.4 | 79.0 | 47.5 | 51.2 | 31.6 | 40.1 |
| overall | 16.7 | 69.1 | 52.5 | 63.6 | 22.3 | 26.5 | 

### Analysis: Why WER Is High (multilingual 100-sample)
- The WER calculation is standard word-level jiwer on normalized text (NFKC, punctuation stripped, digits normalized; only English lowercased). This is not overly aggressive and matches the reasonable scores seen in IndicConformer/OmniASR/M4T.
- The dominant failure mode is wrong script or wrong language output, which yields near-zero token overlap and WER > 100.
  - Whisper large-v3: Odia outputs are 100% non-Odia script (often Devanagari; sample labeled `language=ne`), Malayalam is ~80% Gurmukhi. Assamese stays in Bengali script but is effectively Bengali orthography; CER is ~51% while WER is ~96%, so word overlap is minimal even though characters look similar.
  - Voxtral: Odia is 100% Devanagari; Assamese is ~54% Devanagari. This aligns with Voxtral's limited language support and fallback to Hindi-style output.
  - Gemma-3n: Punjabi is ~91% Devanagari, Gujarati ~64% Devanagari, Odia ~87% non-Odia script. Outputs are often transliterations or Hindi-leaning paraphrases, not strict native-script transcripts.
- LLM-style models also add or paraphrase content. Gemma's Tamil hypotheses are ~1.46x the reference length on average (10% > 2x), which inflates insertions and pushes WER up even when the script matches.
- Models that preserve the expected script (IndicConformer, OmniASR, M4T) show consistent WERs (roughly 12-31), which suggests the evaluator is pairing refs/hyps correctly.

Suggested follow-ups: report CER alongside WER for Bengali-script languages (Assamese/Bengali), and consider transliteration-normalized WER for script-mismatch cases. For Whisper, avoid auto-detect where possible (or explicitly set supported language tokens); for Voxtral/Gemma, enforce supported language/script or post-process via script conversion when benchmarking.
