---
name: System Rating and Gaps
overview: Honest assessment of the transcription + validation system for large-scale Indic TTS/ASR training, including scoring strategy recommendation and identification of critical gaps.
todos:
  - id: integrate-mms
    content: Integrate romanized MMS alignment into pipeline.py alongside existing CTC validation
    status: pending
  - id: tiered-scoring
    content: "Implement tiered scoring logic: max(ctc, mms) as primary, min as secondary check, with accept/review/retry/reject tiers"
    status: pending
  - id: retry-chain
    content: "Implement retry logic: flash -> pro -> reject. Wire into pipeline validation_action='retry'"
    status: pending
  - id: ctc-models
    content: Add CTC model definitions for remaining languages (Gujarati, Marathi, Punjabi, Odia, Assamese, English). Fix Bengali model (currently points to Hindi)
    status: pending
  - id: char-validation
    content: Add Unicode range definitions for missing languages in simple_validator.py (Gujarati, Marathi, Punjabi, Odia, Assamese)
    status: pending
  - id: audio-quality-gate
    content: "Add pre-transcription SNR gate: skip segments below threshold to avoid wasting API credits on noise"
    status: pending
  - id: golden-set
    content: "Create golden test set: 30-50 manually verified segments per priority language for threshold calibration"
    status: pending
isProject: false
---

# Transcription System Assessment: Honest Rating and Critical Gaps

## Rating: 6.5 / 10

Not bad. Not production-ready. The architecture is sound, the prompt engineering is genuinely good, but the validation infrastructure has significant holes that would bite at scale. Here's the breakdown.

---

## What earns the 6.5

**Transcription quality (8/10 on its own):**
- 100% deterministic across 3 runs on 10 segments. This is rare.
- v3 prompt fixed real bugs (translation vs transliteration), has proper anti-hallucination guards.
- 12-language template with per-language script tips. Maintainable.
- Gemini 3 Flash is genuinely strong on Indic audio. Good model choice.

**Architecture (7/10):**
- End-to-end pipeline wired: R2 download → Supabase lookup → audio processing → polishing → transcription → validation → save.
- Audio polisher handles boundary artifacts (proven useful).
- Dual validation concept (native CTC + romanized MMS) is the right idea.

**What's actually proven:** Tested on 10 Telugu segments, 3 runs each. That's it. Everything else is untested assumptions.

---

## What prevents a higher rating

### Critical Gap 1: Validation only works for Telugu (Impact: HIGH)

- **CTC models**: Only Telugu (`wav2vec2-large-xlsr-53-telugu`) is cached. 6 more are defined in code but never downloaded. Bengali's model is wrong (points to Hindi). 5 languages (Gujarati, Marathi, Punjabi, Odia, Assamese) have NO CTC model defined at all.
- **Character validation**: Only 6 of 12 languages have Unicode range definitions in `simple_validator.py`.
- **Romanized MMS**: Not integrated into the pipeline. We ran it ad-hoc in the consistency test but `pipeline.py` doesn't call it.
- **Reality**: If you run this on Hindi tomorrow, transcription works but validation is blind.

### Critical Gap 2: No ground truth (Impact: HIGH)

CTC alignment scores are phonetic alignment quality, NOT accuracy metrics. A perfectly aligned hallucination scores high. Without ground truth:
- You cannot measure actual WER/CER.
- You cannot calibrate thresholds. The current 0.50 "reject" threshold was not derived from data — it's a guess.
- You don't know if 0.72 avg CTC actually means "good enough for TTS training."

**Minimum viable**: 30-50 manually verified segments per language (at least for the top 3-4 priority languages).

### Critical Gap 3: Retry logic is dead code (Impact: MEDIUM)

`PipelineConfig.validation_action = "retry"` exists as a config field. Zero implementation behind it. No model fallback chain (flash → pro). No re-queue for rejected segments.

### Critical Gap 4: No audio quality gating (Impact: MEDIUM)

- SNR is measured by the polisher but **never used for rejection**. Every segment goes to Gemini regardless of audio quality — wasting API credits on noise/music segments.
- PANN (music/noise detection) is in `venv` dependencies but **not integrated anywhere in the pipeline**.
- No pre-transcription filter. Segments that are pure noise, music, or heavily degraded audio still get transcribed and validated, producing garbage data.

### Critical Gap 5: Code-mixing scoring is unaddressed (Impact: MEDIUM)

Our test data proved that native CTC scores poorly on code-mixed segments (segment 0004: CTC=0.48 vs MMS=0.77). With no code-mix-aware scoring, every code-mixed segment in a large run would either:
- Get rejected (wasting good data), or
- Get flagged for manual review (doesn't scale)

Indic podcast audio is 20-40% code-mixed. This is not an edge case.

---

## Scoring Strategy Recommendation

### What each score actually measures

- **Native CTC** (language-specific wav2vec2): Does the native script phonetically match the audio? Good at native words, bad at English words transliterated into native script.
- **Romanized MMS** (language-agnostic torchaudio): Does the Latin transliteration phonetically match the audio? Good at everything, but depends on Gemini's romanization quality.

### From our test data

```
Segment  | Native CTC | Roman MMS | Code-mix
---------|-----------|-----------|----------
0000     | 0.72      | 0.62      | Medium
0001     | 0.70      | 0.80      | High
0003     | 0.82      | 0.90      | High
0004     | 0.48      | 0.77      | Full English
0026     | 0.71      | 0.75      | Medium
0027     | 0.73      | 0.81      | Low
0031     | 0.72      | 0.84      | Medium
0037     | 0.68      | 0.80      | High
0050     | 0.78      | 0.83      | Medium
0058     | 0.84      | 0.87      | Medium
```

**Observation**: MMS is more stable across code-mixing levels. CTC drops on code-mixed content. Neither alone is sufficient.

### Recommended: Tiered scoring with `max` as primary

```python
primary = max(ctc_score, mms_score)   # best validator wins
secondary = min(ctc_score, mms_score) # weakest validator

if primary >= 0.70 and secondary >= 0.40:
    verdict = "accept"     # at least one strong, other not catastrophic
elif primary >= 0.70:
    verdict = "review"     # one strong but other is terrible (suspicious)
elif primary >= 0.55:
    verdict = "retry"      # marginal, worth retrying with pro model
else:
    verdict = "reject"     # both validators say no
```

**Why `max` as primary, not `avg` or `weighted`:**

- `avg` punishes code-mixed segments unfairly (CTC drags down good transcriptions)
- Weighted-by-code-mix adds complexity and requires tuning weights per language
- `max` is robust: if EITHER validator says "this matches the audio", trust it. The secondary check (`min >= 0.40`) catches the case where one validator is coincidentally high on garbage.
- From our data: `max` would accept all 10 segments (correct — they're all good transcriptions). `avg` would flag segment 0004 (incorrect — the transcription IS right, CTC just can't handle English).

**Why not `min`**: Too strict. Would reject valid code-mixed segments where CTC inevitably scores low.

### Retry chain

```
Attempt 1: gemini-3-flash-preview, thinking=low, temp=0
    → if max_score < 0.55: 
Attempt 2: gemini-3-pro (or gemini-2.5-pro), thinking=medium, temp=0
    → if max_score < 0.55 after retry:
        mark as "rejected" / human review queue
```

The threshold for retry should be 0.55 (not 0.70) — segments scoring 0.55-0.70 are borderline and might improve with a better model. Below 0.55, both validators say "no" and a different model might help. Above 0.70, it's already good enough.

---

## Corner Cases for Indic Languages at Scale

1. **Script-ambiguous characters**: Telugu ద vs ధ (da vs dha), Hindi ड vs ड़ (da vs ḍa). If Gemini picks the wrong one, CTC may or may not catch it depending on the wav2vec2 model's phoneme granularity.

2. **Sandhi / agglutination**: Telugu "నాకు ఇది" → "నాకిది". Speaker says joined form, Gemini might write separated form. Both are "correct" but alignment scores would differ. The prompt says "preserve as spoken" but this is hard to verify automatically.

3. **Dialectal normalization**: "వస్తుంది" (formal) vs "వస్తది" (Telangana dialect). Gemini might normalize to formal even with our "no correction" rule. CTC can't tell which is right — both align similarly.

4. **Loan word transliteration variance**: "computer" → కంప్యూటర్ / కంపూటర్ / కంప్యూటరు. Only one matches the specific speaker's pronunciation. Across a large dataset, inconsistent transliteration creates noise.

5. **Zero-width characters**: Malayalam chillu letters, Kannada ZWJ sequences. Invisible in text but affect rendering and tokenization. Could silently corrupt TTS training.

6. **Heavily code-mixed segments (>50% English)**: The entire segment might as well be English. Native CTC is useless. If the pipeline language is "Telugu" but the speaker is 80% English, should this even be in the Telugu dataset?

7. **Whispered / emotional speech**: The wav2vec2 CTC models were trained on read/clean speech. Emotional, whispered, or shouted speech will have systematically lower alignment scores — not because the transcription is wrong, but because the CTC model wasn't trained on that speaking style.

8. **Music/jingle segments**: VAD sometimes passes short music segments (podcast intros). Without PANN, these go through the entire pipeline and produce garbage. The [NO_SPEECH] prompt token helps but relies on Gemini catching it.

---

## Is this sufficient for large-scale TTS/ASR training?

**Not yet, but fixably close.**

The transcription engine (Gemini + prompt) is solid — probably 8/10 quality for Indic languages, which is better than most alternatives (Whisper, IndicConformer, crowd-sourcing). The gap is in validation and quality assurance, not in transcription.

**What large-scale TTS/ASR datasets actually need:**

- **Consistent quality** over quantity. 500 hours of clean, verified data beats 5000 hours of noisy unverified data. Your 100% deterministic output helps here.
- **Text-audio alignment** that's tight enough for forced alignment during training. Your dual CTC+MMS validation concept is right.
- **No poison data** — a single segment with hallucinated text that doesn't match audio can degrade model training disproportionately. This is where the "no ground truth" gap hurts most.

**To get to 8/10 (production-viable):**

1. Integrate romanized MMS alignment into the pipeline (not ad-hoc)
2. Implement the tiered scoring logic (max-based, with retry)
3. Build retry chain: flash → pro → reject
4. Add CTC models for at least top 6 languages
5. Create a golden test set (50 segments per language, manually verified)
6. Add pre-transcription audio quality gate (SNR threshold at minimum)

**To get to 9+/10 (industry-grade):**

7. PANN integration for music/noise pre-filtering
8. Code-mixing ratio computation + language-appropriate handling
9. 5% random human audit on accepted segments
10. Production monitoring: score distributions per language over time (drift detection)
11. Batch API for cost efficiency at scale
12. Deduplication across segments
