---
name: Validation Model Stack Decisions
overview: "Answering all architectural questions about the validation model stack, then proposing the optimal 4-model config: MMS LID + VoxLingua + IndicConformer (GPU-fixed) + wav2vec2-large English CTC, dropping MMS-1B-All."
todos:
  - id: pull-audio
    content: Copy disputed FLAC to project root, convert to WAV
    status: completed
  - id: gpu-preproc
    content: Rewrite IndicConformer TorchScript preprocessor as pure PyTorch mel-spectrogram (n_fft=512, hop=160, win=400, n_mels=80, preemph=0.97)
    status: completed
  - id: english-ctc
    content: Replace MMS-1B-All with wav2vec2-large-960h-lv60-self in wav2vec_lang.py, implement real CTC log-likelihood scoring
    status: completed
  - id: retest
    content: Full pipeline test on Hindi + Malayalam videos, verify GPU conformer + English CTC both work
    status: completed
  - id: todo-1772525047664-8ghmdml9z
    content: "Benchmark on "
    status: pending
isProject: false
---

# Validation Model Stack: Decisions and Optimizations

## Audio File Request

The disputed segment is at `/tmp/val_test_--BYhwbvDSM_op2u2y9m/--BYhwbvDSM/segments/SPEAKER_01_0002_110.60-113.06.flac`. Once you confirm the plan, I will copy it to project root and convert to WAV.

---

## Q1: Why was MMS-1B-All brought in?

**Only because IndicWav2Vec required gated HF access.** The `ai4bharat/indicwav2vec-hindi` (and all per-language variants) are gated repos that need separate approval beyond the HF token. MMS-1B-All was a publicly-available drop-in. It is **not essential** if we have proper Indic + English coverage elsewhere.

**Recommendation: Drop MMS-1B-All entirely.** It is 965M params (~1.9GB VRAM) and duplicates what IndicConformer already does for Indic. Replace it with a lightweight English-only CTC model.

---

## Q2: English CTC Model

The best candidate is `**facebook/wav2vec2-large-960h-lv60-self`**:

- **Params**: ~315M (vs 965M for MMS-1B-All)
- **VRAM**: ~0.6GB fp16
- **WER**: 1.9% clean / 3.9% other on LibriSpeech (state-of-the-art for wav2vec2)
- **Interface**: Standard `AutoModelForCTC` — direct logit access, proper tokenizer, actual CTC scoring possible (character-level vocab, not SentencePiece)
- **Speed**: ~50+ segs/s (wav2vec2 is very fast for English)
- **License**: Apache 2.0

Since English has a proper character-level tokenizer, we CAN do real CTC log-likelihood scoring (not just CER). This gives us P(Gemini's text | audio) directly, which is much more principled than CER.

**Revised model stack:**

```mermaid
flowchart LR
    subgraph lid [LID Models]
        MMS["MMS LID-256\n1B params, 256 langs\n~21 segs/s"]
        Vox["VoxLingua107\n14M params, 107 langs\n~78 segs/s\n+ speaker embeddings"]
    end
    subgraph ctc [CTC Models]
        Conformer["IndicConformer 600M\n22 Indic langs\n~6 segs/s with GPU fix"]
        W2V["wav2vec2-large-960h\n315M params, English only\n~50 segs/s"]
    end
    Audio --> lid
    Audio --> ctc
```



Total VRAM: ~~1.9 (MMS LID) + 0.1 (Vox) + ~0.5 (Conformer ONNX) + 0.6 (wav2vec2) = **~~3.1 GB**. Fits comfortably on a 3090 (24GB).

---

## Q3: Do CTC models actually add value?

**Yes, but the value is nuanced.** Here is what each metric tells you:

### What LID alone gives you (fast, ~20 segs/s)

- Binary: "Is this segment in the expected language?" (yes/no)
- Catches wrong-language segments, code-mixed audio, background noise
- Does NOT tell you anything about transcription quality

### What CTC adds on top (slower, ~6 segs/s)

- **Independent transcription**: A second ASR model's take on what was said
- **CER score**: Continuous 0-1 measure of how much Gemini agrees with the CTC model
- **Greedy confidence**: How sure the CTC model is about its own output
- **Catches hallucinations**: Gemini can produce fluent-sounding text that is completely wrong. LID won't catch this (language is correct), but CTC will (its transcription will be totally different)

### The catch with CER

CER between Gemini and CTC conflates real errors with harmless differences:

- English words in Hindi text: Gemini writes "interview tips", Conformer writes "इंटरव्यू टिप्स" (transliterated) -- high CER but both correct
- Punctuation: Gemini adds commas/periods, CTC doesn't -- inflates CER
- Numbers: "50 meter" vs "पचास मीटर" -- format difference, not error

**But for data gathering, this is exactly what we want.** We are NOT classifying Golden/Redo/Discard right now. We are collecting raw signals. The CER, greedy confidence, and CTC transcription are all columns in the parquet that a later bucketing step can use with any thresholds.

### The compute cost question

- **LID-only pass**: ~11 hours on 100 GPUs, ~$165
- **Full pass (LID + CTC)**: ~39 hours on 100 GPUs (with GPU fix), ~$585
- **CTC costs 3.5x more** in time and money

**My take**: Run the full stack. At $585, the data richness is worth it. You get 27 columns per segment instead of ~10. And the CTC data is the only thing that validates the actual transcription text, not just the language.

---

## Q4: Why IndicConformer runs on CPU (and the fix)

### The problem

The IndicConformer model has an unusual architecture -- it is NOT a regular PyTorch model:

```
preprocessor.ts  (TorchScript)  →  Audio to Mel Spectrogram
encoder.onnx     (ONNX Runtime)  →  Mel features to encoder hidden states  
ctc_decoder.onnx (ONNX Runtime)  →  Hidden states to CTC logprobs
```

The **preprocessor is TorchScript** (`.ts` file loaded via `torch.jit.load`). When MMS LID (regular PyTorch on CUDA) and the TorchScript preprocessor both try to use CUDA, the TorchScript model's internal buffers conflict with PyTorch's CUDA allocations. Result: silent garbage output.

### What the preprocessor actually does

I inspected the TorchScript graph. It is a standard **NeMo AudioToMelSpectrogramPreprocessor**:

1. Pre-emphasis filter (coeff=0.97)
2. STFT: n_fft=512, hop_length=160, win_length=400
3. Power spectrum → mel filterbank (80 mels)
4. Log-mel + normalization

These are ~20 lines of standard `torchaudio` ops.

### The fix: rewrite preprocessor as pure PyTorch

Instead of loading the TorchScript `.ts` file, we rewrite the mel spectrogram computation as a regular PyTorch function using `torchaudio.transforms.MelSpectrogram`. This:

- Runs on CUDA alongside MMS/VoxLingua with zero conflicts
- Is actually faster than TorchScript (avoids CPU-GPU round-trips)
- Uses well-known parameters (all visible in the graph above)
- The ONNX encoder/decoder continue to run on GPU via CUDAExecutionProvider as before

**Estimated speedup: 2-3x** (from ~2.4 segs/s to ~6-10 segs/s per GPU).

---

## Q5: Whisper-large-v3 and alternatives

### Whisper-large-v3 specs

- **Params**: 1.55B
- **VRAM**: ~3.1 GB fp16 -- fits on 3090 (24GB) and 4090 (24GB)
- **Architecture**: Encoder-decoder (NOT CTC) -- generates text autoregressively
- **Languages**: 99+ including English and many Indic

### Whisper variants


| Model                         | Params | VRAM fp16 | Speed       | WER (en) |
| ----------------------------- | ------ | --------- | ----------- | -------- |
| whisper-large-v3              | 1.55B  | ~3.1 GB   | 1x baseline | 2.7%     |
| whisper-large-v3-turbo        | 809M   | ~1.6 GB   | 8x faster   | ~3.0%    |
| distil-whisper-large-v3       | 756M   | ~1.5 GB   | 6x faster   | ~2.8%    |
| wav2vec2-large-960h-lv60-self | 315M   | ~0.6 GB   | ~20x faster | 1.9%     |


### Does Whisper add value here?

**Not really, for this specific use case:**

1. **It is encoder-decoder, not CTC.** You cannot compute P(text|audio) efficiently. You can only generate a transcription and compare via CER -- same as what IndicConformer already does, but 10x slower because of autoregressive decoding.
2. **wav2vec2-large beats it for English.** 1.9% WER vs 2.7% WER, 20x faster, 5x less VRAM, AND you get real CTC logits for proper scoring.
3. **For Indic, IndicConformer is better.** It was trained specifically on Indian languages. Whisper's Indic performance varies wildly by language.
4. **The value Whisper WOULD add**: If you did not have IndicConformer at all, Whisper could serve as a universal "second opinion" for all languages. But you DO have IndicConformer, and it is better for Indic.

### Recommendation for English CTC

**Use `facebook/wav2vec2-large-960h-lv60-self`.** It is:

- Purpose-built for English CTC scoring (exactly what you need)
- Smallest VRAM footprint (0.6 GB)
- Fastest inference (~50 segs/s)
- Best WER (1.9%)
- Has a proper character-level tokenizer so we CAN do real CTC log-likelihood scoring (unlike the IndicConformer SentencePiece issue)

No need for Whisper. The stack is complete with IndicConformer (Indic) + wav2vec2-large (English).

---

## Q6: Realistic timelines with GPU fix

### Per-GPU throughput estimates


| Model                    | Current (CPU preproc) | After GPU fix         |
| ------------------------ | --------------------- | --------------------- |
| MMS LID-256              | 21 segs/s             | 21 segs/s (unchanged) |
| VoxLingua107             | 78 segs/s             | 78 segs/s (unchanged) |
| IndicConformer           | **2.4 segs/s**        | **~6-10 segs/s**      |
| wav2vec2-large (English) | N/A (new)             | ~50 segs/s            |
| **Pipeline total**       | **~2 segs/s**         | **~5-8 segs/s**       |


### 70M segments on 100 GPUs (RTX 3090 @ $0.15/hr)


| Scenario                                 | Time                            | Cost          |
| ---------------------------------------- | ------------------------------- | ------------- |
| LID only (no CTC)                        | ~11 hours                       | ~$165         |
| Full stack, CPU preprocessor (current)   | ~97 hours (4 days)              | ~$1,450       |
| **Full stack, GPU preprocessor (fixed)** | **~28-39 hours (1.2-1.6 days)** | **~$420-585** |
| Full stack, GPU fix + drop MMS-1B-All    | ~24-35 hours (1-1.5 days)       | ~$360-525     |


---

## Implementation Plan

### Step 1: Pull and convert the audio file

Copy the disputed segment to project root, convert FLAC to WAV for listening.

### Step 2: Fix IndicConformer GPU preprocessor

Rewrite the TorchScript mel-spectrogram as pure PyTorch in `[validations/models/conformer_multi.py](validations/models/conformer_multi.py)`. Parameters from the graph: n_fft=512, hop=160, win=400, n_mels=80, preemphasis=0.97.

### Step 3: Replace MMS-1B-All with wav2vec2-large English CTC

Rewrite `[validations/models/wav2vec_lang.py](validations/models/wav2vec_lang.py)` to use `facebook/wav2vec2-large-960h-lv60-self`. Since English has a character-level tokenizer, implement proper CTC log-likelihood scoring (not just CER).

### Step 4: Re-test full pipeline

Run on both the Hindi video (--46oQrkfig) and Malayalam video (--BYhwbvDSM) to verify:

- GPU conformer produces identical results to CPU version
- English CTC model produces valid scores for English segments
- Total per-video time is under 30s for ~50 segments

