---
name: Validation Model Stack Decisions
overview: "Answering all architectural questions about the validation model stack, then proposing the optimal 4-model config: MMS LID + VoxLingua + IndicConformer (GPU-fixed) + wav2vec2-large English CTC, dropping MMS-1B-All."
todos:
  - id: pull-audio
    content: Copy disputed FLAC to project root, convert to WAV
    status: pending
  - id: gpu-preproc
    content: Rewrite IndicConformer TorchScript preprocessor as pure PyTorch mel-spectrogram (n_fft=512, hop=160, win=400, n_mels=80, preemph=0.97)
    status: pending
  - id: english-ctc
    content: Replace MMS-1B-All with wav2vec2-large-960h-lv60-self in wav2vec_lang.py, implement real CTC log-likelihood scoring
    status: pending
  - id: retest
    content: Full pipeline test on Hindi + Malayalam videos, verify GPU conformer + English CTC both work
    status: pending
isProject: false
---

# Validation Model Stack: Decisions and Optimizations

## Audio File Request

The disputed segment is at `/tmp/val_test_--BYhwbvDSM_op2u2y9m/--BYhwbvDSM/segments/SPEAKER_01_0002_110.60-113.06.flac`. Once you confirm the plan, I will copy it to project root and convert to WAV.

---

## Q1: Why was MMS-1B-All brought in?

**Only because IndicWav2Vec required gated HF access.** The `ai4bharat/indicwav2vec-hindi` (and all per-language variants) are gated repos that need separate approval beyond the HF token. MMS-1B-All was a publicly-available drop-in. It is **not essential** if we have proper Indic + English coverage elsewhere.

**Recommendation: Drop MMS-1B-All entirely.** It is 965M params (~1.9GB VRAM) and duplicates what IndicConformer already does for Indic. Replace it with a lightweight English-only CTC model.

---

## Q2: English CTC Model

The best candidate is **`facebook/wav2vec2-large-960h-lv60-self`**:

- **Params**: ~315M (vs 965M for MMS-1B-All)
- **VRAM**: ~0.6GB fp16
- **WER**: 1.9% clean / 3.9% other on LibriSpeech (state-of-the-art for wav2vec2)
- **Interface**: Standard `AutoModelForCTC` — direct logit access, proper tokenizer, actual CTC scoring possible (character-level vocab, not SentencePiece)
- **Speed**: ~50+ segs/s (wav2vec2 is very fast for English)
- **License**: Apache 2.0

Since English has a proper character-level tokenizer, we CAN do real CTC log-likelihood scoring (not just CER). This gives us P(Gemini's text | audio) directly, which is much more principled than CER.

**Revised model stack:**

```mermaid
flowchart LR
    subgraph lid [LID Models]
        MMS["MMS LID-256\n1B params, 256 langs\n~21 segs/s"]
        Vox["VoxLingua107\n14M params, 107 langs\n~78 segs/s\n+ speaker embeddings"]
    end
    subgraph ctc [CTC Models]
        Conformer["IndicConformer 600M\n22 Indic langs\n~6 segs/s with GPU fix"]
        W2V["wav2vec2-large-960h\n315M params, English only\n~50 segs/s"]
    end
    Audio --> lid
    Audio --> ctc
```

Total VRAM: ~1.9 (MMS LID) + 0.1 (Vox) + ~0.5 (Conformer ONNX) + 0.6 (wav2vec2) = **~3.1 GB**. Fits comfortably on a 3090 (24GB).

---

## Q3: Do CTC models actually add value?

**Yes, but the value is nuanced.** Here is what each metric tells you:

### What LID alone gives you (fast, ~20 segs/s)
- Binary: "Is this segment in the expected language?" (yes/no)
- Catches wrong-language segments, code-mixed audio, background noise
- Does NOT tell you anything about transcription quality

### What CTC adds on top (slower, ~6 segs/s)
- **Independent transcription**: A second ASR model's take on what was said
- **CER score**: Continuous 0-1 measure of how much Gemini agrees with the CTC model
- **Greedy confidence**: How sure the CTC model is about its own output
- **Catches hallucinations**: Gemini can produce fluent-sounding text that is completely wrong. LID won't catch this (language is correct), but CTC will (its transcription will be totally different)

### The catch with CER
CER between Gemini and CTC conflates real errors with harmless differences:
- English words in Hindi text: Gemini writes "interview tips", Conformer writes "इंटरव्यू टिप्स" (transliterated) -- high CER but both correct
- Punctuation: Gemini adds commas/periods, CTC doesn't -- inflates CER
- Numbers: "50 meter" vs "पचास मीटर" -- format difference, not error

**But for data gathering, this is exactly what we want.** We are NOT classifying Golden/Redo/Discard right now. We are collecting raw signals. The CER, greedy confidence, and CTC transcription are all columns in the parquet that a later bucketing step can use with any thresholds.

### The compute cost question
- **LID-only pass**: ~11 hours on 100 GPUs, ~$165
- **Full pass (LID + CTC)**: ~39 hours on 100 GPUs (with GPU fix), ~$585
- **CTC costs 3.5x more** in time and money

**My take**: Run the full stack. At $585, the data richness is worth it. You get 27 columns per segment instead of ~10. And the CTC data is the only thing that validates the actual transcription text, not just the language.

---

## Q4: Why IndicConformer runs on CPU (and the fix)

### The problem
The IndicConformer model has an unusual architecture -- it is NOT a regular PyTorch model:

```
preprocessor.ts  (TorchScript)  →  Audio to Mel Spectrogram
encoder.onnx     (ONNX Runtime)  →  Mel features to encoder hidden states  
ctc_decoder.onnx (ONNX Runtime)  →  Hidden states to CTC logprobs
```

The **preprocessor is TorchScript** (`.ts` file loaded via `torch.jit.load`). When MMS LID (regular PyTorch on CUDA) and the TorchScript preprocessor both try to use CUDA, the TorchScript model's internal buffers conflict with PyTorch's CUDA allocations. Result: silent garbage output.

### What the preprocessor actually does
I inspected the TorchScript graph. It is a standard **NeMo AudioToMelSpectrogramPreprocessor**:

1. Pre-emphasis filter (coeff=0.97)
2. STFT: n_fft=512, hop_length=160, win_length=400
3. Power spectrum → mel filterbank (80 mels)
4. Log-mel + normalization

These are ~20 lines of standard `torchaudio` ops.

### The fix: rewrite preprocessor as pure PyTorch
Instead of loading the TorchScript `.ts` file, we rewrite the mel spectrogram computation as a regular PyTorch function using `torchaudio.transforms.MelSpectrogram`. This:

- Runs on CUDA alongside MMS/VoxLingua with zero conflicts
- Is actually faster than TorchScript (avoids CPU-GPU round-trips)
- Uses well-known parameters (all visible in the graph above)
- The ONNX encoder/decoder continue to run on GPU via CUDAExecutionProvider as before

**Estimated speedup: 2-3x** (from ~2.4 segs/s to ~6-10 segs/s per GPU).

---

## Q5: Whisper-large-v3 and alternatives

### Whisper-large-v3 specs
- **Params**: 1.55B
- **VRAM**: ~3.1 GB fp16 -- fits on 3090 (24GB) and 4090 (24GB)
- **Architecture**: Encoder-decoder (NOT CTC) -- generates text autoregressively
- **Languages**: 99+ including English and many Indic

### Whisper variants

| Model | Params | VRAM fp16 | Speed | WER (en) |
|-------|--------|-----------|-------|----------|
| whisper-large-v3 | 1.55B | ~3.1 GB | 1x baseline | 2.7% |
| whisper-large-v3-turbo | 809M | ~1.6 GB | 8x faster | ~3.0% |
| distil-whisper-large-v3 | 756M | ~1.5 GB | 6x faster | ~2.8% |
| wav2vec2-large-960h-lv60-self | 315M | ~0.6 GB | ~20x faster | 1.9% |

### Does Whisper add value here?

**Not really, for this specific use case:**

1. **It is encoder-decoder, not CTC.** You cannot compute P(text|audio) efficiently. You can only generate a transcription and compare via CER -- same as what IndicConformer already does, but 10x slower because of autoregressive decoding.

2. **wav2vec2-large beats it for English.** 1.9% WER vs 2.7% WER, 20x faster, 5x less VRAM, AND you get real CTC logits for proper scoring.

3. **For Indic, IndicConformer is better.** It was trained specifically on Indian languages. Whisper's Indic performance varies wildly by language.

4. **The value Whisper WOULD add**: If you did not have IndicConformer at all, Whisper could serve as a universal "second opinion" for all languages. But you DO have IndicConformer, and it is better for Indic.

### Recommendation for English CTC

**Use `facebook/wav2vec2-large-960h-lv60-self`.** It is:
- Purpose-built for English CTC scoring (exactly what you need)
- Smallest VRAM footprint (0.6 GB)
- Fastest inference (~50 segs/s)
- Best WER (1.9%)
- Has a proper character-level tokenizer so we CAN do real CTC log-likelihood scoring (unlike the IndicConformer SentencePiece issue)

No need for Whisper. The stack is complete with IndicConformer (Indic) + wav2vec2-large (English).

---

## Q6: Realistic timelines with GPU fix

### Per-GPU throughput estimates

| Model | Current (CPU preproc) | After GPU fix |
|-------|----------------------|---------------|
| MMS LID-256 | 21 segs/s | 21 segs/s (unchanged) |
| VoxLingua107 | 78 segs/s | 78 segs/s (unchanged) |
| IndicConformer | **2.4 segs/s** | **~6-10 segs/s** |
| wav2vec2-large (English) | N/A (new) | ~50 segs/s |
| **Pipeline total** | **~2 segs/s** | **~5-8 segs/s** |

### 70M segments on 100 GPUs (RTX 3090 @ $0.15/hr)

| Scenario | Time | Cost |
|----------|------|------|
| LID only (no CTC) | ~11 hours | ~$165 |
| Full stack, CPU preprocessor (current) | ~97 hours (4 days) | ~$1,450 |
| **Full stack, GPU preprocessor (fixed)** | **~28-39 hours (1.2-1.6 days)** | **~$420-585** |
| Full stack, GPU fix + drop MMS-1B-All | ~24-35 hours (1-1.5 days) | ~$360-525 |

---

## Implementation Plan

### Step 1: Pull and convert the audio file
Copy the disputed segment to project root, convert FLAC to WAV for listening.

### Step 2: Fix IndicConformer GPU preprocessor
Rewrite the TorchScript mel-spectrogram as pure PyTorch in [`validations/models/conformer_multi.py`](validations/models/conformer_multi.py). Parameters from the graph: n_fft=512, hop=160, win=400, n_mels=80, preemphasis=0.97.

### Step 3: Replace MMS-1B-All with wav2vec2-large English CTC
Rewrite [`validations/models/wav2vec_lang.py`](validations/models/wav2vec_lang.py) to use `facebook/wav2vec2-large-960h-lv60-self`. Since English has a character-level tokenizer, implement proper CTC log-likelihood scoring (not just CER).

### Step 4: Re-test full pipeline
Run on both the Hindi video (--46oQrkfig) and Malayalam video (--BYhwbvDSM) to verify:
- GPU conformer produces identical results to CPU version
- English CTC model produces valid scores for English segments
- Total per-video time is under 30s for ~50 segments