---
name: System Rating and Gaps
overview: Honest assessment of the transcription + validation system for large-scale Indic TTS/ASR training, including scoring strategy recommendation and identification of critical gaps.
todos:
  - id: integrate-mms
    content: Integrate romanized MMS alignment into pipeline.py alongside existing CTC validation
    status: completed
  - id: tiered-scoring
    content: "Implement weighted scoring: S = 0.45*N + 0.55*R - 0.10*abs(N-R), with floor check min(N,R) >= 0.40, disagreement flag at abs > 0.25, and accept/review/retry/reject tiers"
    status: completed
  - id: retry-chain
    content: "Implement retry logic: flash -> pro -> reject. Wire into pipeline validation_action='retry'"
    status: completed
  - id: ctc-models
    content: Add CTC model definitions for remaining languages (Gujarati, Marathi, Punjabi, Odia, Assamese, English). Fix Bengali model (currently points to Hindi)
    status: in_progress
  - id: char-validation
    content: Add Unicode range definitions for missing languages in simple_validator.py (Gujarati, Marathi, Punjabi, Odia, Assamese)
    status: pending
  - id: audio-quality-gate
    content: "Add pre-transcription SNR gate: skip segments below threshold to avoid wasting API credits on noise"
    status: completed
  - id: golden-set
    content: "Create golden test set: 30-50 manually verified segments per priority language for threshold calibration"
    status: pending
isProject: false
---

# Transcription System Assessment: Honest Rating and Critical Gaps

## Rating: 6/10 (Telugu) / 4/10 (Pan-Indic)

A decent bootstrap pipeline. Not yet "large TTS/ASR training-grade" on its own. The transcription engine is solid, the validation infrastructure has significant holes. Sufficient as a first-pass data factory if you add stricter calibration, retry logic, and audit loops.

---

## What earns the 6

**Transcription quality (8/10 on its own):**

- 100% deterministic across 3 runs on 10 segments. This is rare.
- v3 prompt fixed real bugs (translation vs transliteration), has proper anti-hallucination guards.
- 12-language template with per-language script tips. Maintainable.
- Gemini 3 Flash is genuinely strong on Indic audio. Good model choice.

**Architecture (7/10):**

- End-to-end pipeline wired: R2 download → Supabase lookup → audio processing → polishing → transcription → validation → save.
- Audio polisher handles boundary artifacts (proven useful).
- Dual validation concept (native CTC + romanized MMS) is the right idea.

**What's actually proven:** Tested on 10 Telugu segments, 3 runs each. That's it. Everything else is untested assumptions.

---

## What prevents a higher rating

### Critical Gap 1: Validation only works for Telugu (Impact: HIGH)

- **CTC models**: Only Telugu (`wav2vec2-large-xlsr-53-telugu`) is cached. 6 more are defined in code but never downloaded. Bengali's model is wrong (points to Hindi). 5 languages (Gujarati, Marathi, Punjabi, Odia, Assamese) have NO CTC model defined at all.
- **Character validation**: Only 6 of 12 languages have Unicode range definitions in `simple_validator.py`.
- **Romanized MMS**: Not integrated into the pipeline. We ran it ad-hoc in the consistency test but `pipeline.py` doesn't call it.
- **Reality**: If you run this on Hindi tomorrow, transcription works but validation is blind.

### Critical Gap 2: No ground truth (Impact: HIGH)

CTC alignment scores are phonetic alignment quality, NOT accuracy metrics. A perfectly aligned hallucination scores high. Without ground truth:

- You cannot measure actual WER/CER.
- You cannot calibrate thresholds. The current 0.50 "reject" threshold was not derived from data — it's a guess.
- You don't know if 0.72 avg CTC actually means "good enough for TTS training."

**Minimum viable**: 30-50 manually verified segments per language (at least for the top 3-4 priority languages).

### Critical Gap 3: Retry logic is dead code (Impact: MEDIUM)

`PipelineConfig.validation_action = "retry"` exists as a config field. Zero implementation behind it. No model fallback chain (flash → pro). No re-queue for rejected segments.

### Critical Gap 4: No audio quality gating (Impact: MEDIUM)

- SNR is measured by the polisher but **never used for rejection**. Every segment goes to Gemini regardless of audio quality — wasting API credits on noise/music segments.
- PANN (music/noise detection) is in `venv` dependencies but **not integrated anywhere in the pipeline**.
- No pre-transcription filter. Segments that are pure noise, music, or heavily degraded audio still get transcribed and validated, producing garbage data.

### Critical Gap 5: Code-mixing scoring is unaddressed (Impact: MEDIUM)

Our test data proved that native CTC scores poorly on code-mixed segments (segment 0004: CTC=0.48 vs MMS=0.77). With no code-mix-aware scoring, every code-mixed segment in a large run would either:

- Get rejected (wasting good data), or
- Get flagged for manual review (doesn't scale)

Indic podcast audio is 20-40% code-mixed. This is not an edge case.

---

## Scoring Strategy Recommendation

### What each score actually measures

- **Native CTC** (language-specific wav2vec2): Does the native script phonetically match the audio? Good at native words, bad at English words transliterated into native script.
- **Romanized MMS** (language-agnostic torchaudio): Does the Latin transliteration phonetically match the audio? Good at everything, but depends on Gemini's romanization quality.

### From our test data

```
Segment  | Native CTC | Roman MMS | Code-mix
---------|-----------|-----------|----------
0000     | 0.72      | 0.62      | Medium
0001     | 0.70      | 0.80      | High
0003     | 0.82      | 0.90      | High
0004     | 0.48      | 0.77      | Full English
0026     | 0.71      | 0.75      | Medium
0027     | 0.73      | 0.81      | Low
0031     | 0.72      | 0.84      | Medium
0037     | 0.68      | 0.80      | High
0050     | 0.78      | 0.83      | Medium
0058     | 0.84      | 0.87      | Medium
```

**Observation**: MMS is more stable across code-mixing levels. CTC drops on code-mixed content. Neither alone is sufficient.

### Recommended: Weighted score with disagreement penalty

```python
N = native_ctc_score
R = roman_mms_score
S = 0.45 * N + 0.55 * R - 0.10 * abs(N - R)

if S >= 0.70 and min(N, R) >= 0.40:
    if abs(N - R) > 0.25:
        verdict = "review"   # high disagreement, flag even if score is good
    else:
        verdict = "accept"
elif S >= 0.55:
    verdict = "retry"        # borderline, worth retrying with pro model
else:
    verdict = "reject"
```

**Why this formula:**

- **Weighted, not max/min/avg**: `max` is too lenient (ignores when one validator flags badly). `min` is too strict (kills code-mixed segments). `avg` is naive. Weighted lets us express that MMS is more reliable for Indic code-mixed audio.
- **MMS weighted higher (0.55 vs 0.45)**: MMS is language-agnostic and stable across all code-mixing levels. CTC drops sharply on English-heavy segments (seg 0004: CTC=0.48 vs MMS=0.77). For Indic podcasts with 20-40% code-mixing, MMS is the more trustworthy signal.
- **Disagreement penalty (-0.10 * abs(N-R))**: When validators diverge strongly, something unusual is happening (heavy code-mixing, boundary artifact, bad romanization). The penalty flags these for closer inspection.
- **Floor check (min >= 0.40, not 0.55)**: CTC legitimately scores ~0.48 on correct English segments. A 0.55 floor would reject valid code-mixed transcriptions. 0.40 catches genuine failures without false positives.
- **Forced review on high disagreement (abs > 0.25)**: Even if the weighted score passes, large validator divergence is suspicious enough to flag.

**On our 10 test segments**: This formula accepts 8/10 (correct transcriptions), retries 2/10 (boundary artifact + pure English segment). The pure max approach would have accepted all 10, which is too lenient — it misses the boundary artifact case where a second look is warranted.

**Important caveat**: These weights are a hypothesis. They MUST be calibrated against a golden test set before production use. A deterministic single-source system can deterministically produce high-confidence mistakes at scale. The formula prevents obvious failures, but only human-audited ground truth prevents systematic ones.

### Retry chain

```
Attempt 1: gemini-3-flash-preview, thinking=low, temp=0
    → if S < 0.70:
Attempt 2: gemini-3-pro (or gemini-2.5-pro), thinking=medium, temp=0
    → if S < 0.70 after retry:
        mark as "rejected" / human review queue
```

---

## Corner Cases for Indic Languages at Scale

**Script / Character Level:**

1. **Script-ambiguous characters**: Telugu ద vs ధ (da vs dha), Hindi ड vs ड़ (da vs ḍa). If Gemini picks the wrong one, CTC may or may not catch it depending on phoneme granularity.
2. **Zero-width characters**: Malayalam chillu letters, Kannada ZWJ sequences. Invisible in text but affect rendering and tokenization. Could silently corrupt TTS training.
3. **Script-specific confusions**: Nukta/Chandrabindu/chillu/Assamese variants — the language pack tips help but can't be validated automatically.

**Word / Phrase Level:**
4. **Sandhi / agglutination**: Telugu "నాకు ఇది" → "నాకిది". Speaker says joined form, Gemini might write separated form. Hard to verify automatically.
5. **Loan word transliteration variance**: "computer" → కంప్యూటర్ / కంపూటర్ / కంప్యూటరు. Inconsistent transliteration across segments creates noise for TTS.
6. **Romanization spelling variance**: Same pronunciation can be romanized differently across segments ("gurtuntaayi" vs "gurtuntayi" vs "gurtuntai"). Not caught by alignment, creates inconsistency.
7. **Numbers, dates, currency, abbreviations**: "Rs. 500" / "five hundred rupees" / "ఐదు వందలు" — multiple valid representations. No normalization standard.
8. **Brand names / proper nouns in code-switch**: "Parle-G", "SKU", "YouTube" — CTC models have no phoneme mapping for these, alignment scores are unreliable on them.
9. **Phonetically plausible but semantically wrong words**: CTC aligns phonemes, not meaning. If Gemini writes a word that sounds similar but means something different, CTC scores it as correct. This is a fundamental blind spot of phonetic validation.

**Audio / Segment Level:**
10. **Dialectal normalization**: "వస్తుంది" (formal) vs "వస్తది" (Telangana dialect). Gemini might normalize to formal even with the "no correction" rule.
11. **Heavily code-mixed segments (>50% English)**: Native CTC is useless. Pipeline labels it as "Telugu" but speaker is 80% English.
12. **Whispered / emotional speech**: CTC models trained on read/clean speech. Emotional, whispered, or shouted speech gets systematically lower scores even when transcription is correct.
13. **Music/jingle segments**: VAD false positives. Without PANN, these produce garbage data. [NO_SPEECH] token helps but depends on Gemini detecting it.
14. **Overlapped speakers / crosstalk**: Diarization imperfections leave cross-talk in segments (as we found with the "yeah yeah" case). Audio quality is degraded for TTS.
15. **VAD boundary cuts with overlap=0**: Segments split at 10s hard limit with no overlap. Sentences can be cut mid-word at split points.

**Validation Blind Spot:**
16. **CTC tokenization silently skips OOV characters** (`ctc_forced_aligner.py:193`): Characters not in the wav2vec2 vocabulary are dropped during tokenization. This means CTC alignment literally cannot see certain characters, hiding errors on them. The alignment score looks fine because the OOV chars were never tested.

---

## Is this sufficient for large-scale TTS/ASR training?

**Not yet, but fixably close.**

The transcription engine (Gemini + prompt) is solid — probably 8/10 quality for Indic languages, which is better than most alternatives (Whisper, IndicConformer, crowd-sourcing). The gap is in validation and quality assurance, not in transcription.

**What large-scale TTS/ASR datasets actually need:**

- **Consistent quality** over quantity. 500 hours of clean, verified data beats 5000 hours of noisy unverified data. Your 100% deterministic output helps here.
- **Text-audio alignment** that's tight enough for forced alignment during training. Your dual CTC+MMS validation concept is right.
- **No poison data** — a single segment with hallucinated text that doesn't match audio can degrade model training disproportionately. This is where the "no ground truth" gap hurts most.

**To get to 8/10 (production-viable):**

1. Integrate romanized MMS alignment into the pipeline (not ad-hoc)
2. Implement the weighted scoring formula with disagreement penalty
3. Build retry chain: flash → pro → reject
4. Add CTC models for at least top 6 languages
5. Fix CTC aligner silently dropping OOV chars (`ctc_forced_aligner.py:193`) — at minimum log warnings and penalize alignment score when OOV ratio is high
6. Create a golden test set (50 segments per language, manually verified) to calibrate scoring formula weights
7. Add pre-transcription audio quality gate (SNR threshold at minimum)

**To get to 9+/10 (industry-grade):**

1. PANN integration for music/noise pre-filtering
2. Code-mixing ratio computation + language-appropriate handling
3. 5% random human audit on accepted segments
4. Production monitoring: score distributions per language over time (drift detection)
5. Batch API for cost efficiency at scale
6. Deduplication across segments

