---
name: Verification Pipeline LID CTC
overview: Dual-model LID verification + CTC confidence scoring on 75M segments across 50-100 Vast.ai GPUs, producing Golden/Redo/Discard sets for the 489K target-language videos.
todos:
  - id: docker-verify
    content: Build verification Docker image with MMS LID-256 + VoxLingua107 + IndicConformer, test locally on 2-3 videos
    status: pending
  - id: deploy-verify
    content: Deploy to Vast.ai (50-100 GPUs), run verification on all 489K target-language videos
    status: pending
  - id: classify-sets
    content: Classify results into Golden/Redo/Discard sets, export final Parquet with tier labels
    status: pending
  - id: redo-pro
    content: Re-transcribe Redo set with Gemini Pro model, re-verify, merge into Golden or Discard
    status: pending
isProject: false
---

# Verification Pipeline: LID Consensus + CTC Confidence Scoring

## Goal

Run every segment through 2 LID models + 1 CTC confidence scorer to produce three output sets:

- **Golden Set**: LID consensus matches Gemini's detected language, CTC confidence is high. Ready for training.
- **Redo Set**: LID disagrees with Gemini OR CTC confidence is low. Re-transcribe with Gemini Pro.
- **Discard Set**: Both LID models disagree with Gemini AND CTC confidence is very low. Bad audio / wrong language / garbage.

## Architecture

```mermaid
flowchart TD
    R2[R2 transcribed tars] --> Worker
    Parquet[Local Parquet metadata] --> Worker
    
    subgraph Worker["GPU Worker (per video)"]
        Download[Download tar, extract FLACs] --> LID1["MMS LID-256 (wav2vec2, 1B params)"]
        Download --> LID2["VoxLingua107 ECAPA-TDNN (14M params)"]
        Download --> CTC["IndicConformer CTC (120M params)"]
        
        LID1 --> Consensus
        LID2 --> Consensus
        CTC --> Confidence["CTC confidence score"]
        
        Consensus --> Classify
        Confidence --> Classify
        
        Classify --> Golden[Golden]
        Classify --> Redo[Redo]
        Classify --> Discard[Discard]
    end
    
    Golden --> GoldenDB[(results parquet)]
    Redo --> RedoDB[(redo list)]
    Discard --> DiscardDB[(discard list)]
```


## Models

### 1. MMS LID-256 (Meta)

- **Architecture**: Wav2Vec2, 1B parameters
- **Languages**: 256 (covers all 12 Indic targets + English)
- **GPU memory**: ~4 GB (fp16)
- **Throughput estimate**: ~200-400 segments/min per GPU (5-10s audio, batched)
- **What it gives us**: Language classification probability distribution per segment

### 2. VoxLingua107 ECAPA-TDNN (SpeechBrain)

- **Architecture**: ECAPA-TDNN, ~14M parameters
- **Languages**: 107 (covers all targets)
- **GPU memory**: ~1 GB
- **Throughput estimate**: ~500-1000 segments/min per GPU (very lightweight)
- **What it gives us**: Second opinion on language + 256-dim speaker embedding (bonus: free speaker clustering data for TTS)

### 3. IndicConformer CTC (AI4Bharat)

- **Architecture**: Conformer-Large, 120M parameters, hybrid CTC-RNNT
- **Languages**: 22 Indic languages (per-language models, ~120M each)
- **GPU memory**: ~1 GB per language model
- **Throughput estimate**: ~100-200 segments/min per GPU
- **What it gives us**: CTC log-probabilities for Gemini's transcription text. High logprob = the audio matches the text. Low logprob = hallucination or wrong transcription.

All three models fit on a single RTX 3090 (24GB) simultaneously in fp16.

## CTC Confidence Scoring (the key insight)

Instead of full forced alignment (which needs good grapheme-to-phoneme models that don't exist for most Indic languages), we use CTC scoring:

1. Take Gemini's transcription text for a segment
2. Romanize/transliterate it to the IndicConformer's expected script (most use native script, some need normalization)
3. Run IndicConformer CTC forward pass on the audio
4. Compute the CTC log-likelihood of Gemini's text given the audio
5. High score = audio matches text = Gemini got it right
6. Low score = either hallucination, wrong language, or bad audio

This is NOT ASR (we don't re-transcribe). We just ask "how likely is this text given this audio?" -- a scoring operation, much faster than full transcription.

**Why not full forced alignment**: Indic forced aligners (MFA, Montreal) have poor coverage for Tamil/Telugu/Kannada/etc. The phoneme dictionaries are incomplete. CTC scoring bypasses this entirely -- it works at the character/subword level directly.

**Why not romanization + forced alignment**: Romanization is lossy for Indic scripts (multiple characters map to same romanized form). CTC on native script is strictly better because IndicConformer was trained on native script.

## Classification Rules

For each segment, we have:

- `gemini_lang`: Gemini's detected language
- `mms_lang`: MMS LID-256 top prediction + confidence
- `vox_lang`: VoxLingua107 top prediction + confidence  
- `ctc_score`: IndicConformer CTC logprob for Gemini's text

### Golden Set (high confidence, ready for training)

All of these must hold:

- LID consensus: at least 2 of 3 (Gemini, MMS, VoxLingua) agree on language
- MMS confidence >= 0.7 OR VoxLingua confidence >= 0.7
- CTC confidence score above language-specific threshold (calibrated per language)
- Gemini quality_score >= 0.9

### Redo Set (uncertain, re-transcribe with Pro model)

Any of these:

- LID split: all 3 disagree (Gemini vs MMS vs VoxLingua)
- CTC confidence below threshold but LID agrees (text might be wrong but language is right)
- Gemini quality_score 0.5-0.9

### Discard Set (bad data)

Any of these:

- Both MMS and VoxLingua agree on a NON-target language (not in our 12 + English)
- CTC confidence extremely low AND LID disagrees with Gemini
- Audio is silence/noise (both LID models have very low confidence on any language)
- Gemini quality_score < 0.5

## Deployment Plan (Vast.ai, 50-100 GPUs)

### Docker Image

Same pattern as transcription pipeline. Single container with:

- PyTorch + transformers + speechbrain + nemo
- All three model weights baked in (or downloaded on startup, ~6GB total)
- Worker claims videos from Supabase queue (reuse existing `video_queue` table with new status column or a new table)

### Per-Worker Flow

1. Claim a video from queue
2. Download the `_transcribed.tar` from R2 (already has polished FLACs)
3. Load all segment FLACs into memory
4. Batch through MMS LID-256 (all segments)
5. Batch through VoxLingua107 (all segments)
6. For segments where Gemini detected an Indic language: batch through IndicConformer CTC
7. Classify each segment into Golden/Redo/Discard
8. Write results to Supabase (new `verification_results` table) or upload as Parquet shard
9. Mark video as verified

### Throughput Estimate

- Bottleneck: MMS LID at ~300 segs/min/GPU
- 75M segments / 300 segs/min = 250K GPU-minutes = ~4,200 GPU-hours
- 50 GPUs: ~~84 hours (~~3.5 days)
- 100 GPUs: ~~42 hours (~~1.7 days)
- Cost: RTX 3090 at ~$0.15/hr x 4,200 GPU-hours = ~$630

### Bonus Output: Speaker Embeddings

VoxLingua107 ECAPA-TDNN produces 256-dim speaker embeddings as a byproduct. We save these -- they're free and invaluable for:

- Speaker diarization verification (are segments from the same speaker actually similar?)
- Speaker clustering for multi-speaker TTS training
- Deduplication (find segments from same speaker across videos)

## Output Schema

New table/parquet: `verification_results`

- `video_id`, `segment_file`
- `mms_lang`, `mms_confidence` (top-1 language and probability)
- `vox_lang`, `vox_confidence`
- `ctc_score` (IndicConformer logprob, null for English segments)
- `lid_consensus` (boolean: at least 2/3 agree)
- `tier` (golden / redo / discard)
- `speaker_embedding` (256-dim float array from VoxLingua107)

## Expected Outcome

Based on the quality analysis we already did:

- **Golden**: ~65-70M segments (85-90%). The data is genuinely clean -- LID + CTC will confirm this.
- **Redo**: ~3-5M segments (5-7%). Edge cases where language is ambiguous or CTC suggests transcription issues.
- **Discard**: ~2-3M segments (3-5%). Non-target languages, silence, noise, garbage.

## What This Unlocks

After this pipeline completes, you have:

- **Per-segment verified language labels** (not YouTube metadata, not just Gemini -- triple-verified)
- **Transcription quality confidence** (CTC scoring catches hallucinations that quality_score alone misses)
- **Speaker embeddings** for every segment (free clustering for TTS)
- **A clean Redo list** to send to Gemini Pro for the ~5% that needs re-transcription
- **Complete trust in the Golden set** for training ASR and TTS models