---
name: Post-Transcription Quality Pipeline
overview: Quality analysis, filtering, and data export pipeline to turn 507K transcribed videos (~73M segments) into training-ready datasets for both TTS and ASR across 12 Indic languages.
todos:
  - id: quality-deep-dive
    content: "Phase 1: Run quality deep-dive SQL analysis (lang mismatch breakdown by language pair, TTS yield without mismatch gate, boundary score distribution, per-language quality)"
    status: pending
  - id: retiering
    content: "Phase 2: Define and apply new quality tiers (A/B/C/D) based on Phase 1 findings, add tier column or compute at export"
    status: pending
  - id: export-pipeline
    content: "Phase 3: Build data export pipeline (Supabase + R2 -> HuggingFace/WebDataset format, per-language splits, long-context recombination)"
    status: pending
  - id: training-setup
    content: "Phase 4: Training setup (ASR: Whisper/IndicConformer fine-tune, TTS: VITS/XTTS per-language, speaker clustering)"
    status: pending
isProject: false
---

# Post-Transcription: Quality Analysis, Filtering, and Training Data Pipeline

## Where We Are

```
507,387 videos transcribed | ~73M segments | 12 languages
99% quality >= 0.9 | 99% ASR eligible | 20% TTS eligible | 79% lang mismatch
```

The 20% TTS yield and 79% lang mismatch are the key issues to investigate and address.

## The Lang Mismatch Problem

The TTS eligibility check in [src/validator.py](src/validator.py) (line 180-186) requires `not result.lang_mismatch`. The mismatch flag fires when Gemini's `detected_language` differs from the video queue's `expected_language`. This is overly strict because:

- **Code-mixed speech is the norm** for Indic languages (Hindi+English, Tamil+English, etc.)
- A "Hindi" video where the speaker says one English sentence gets flagged as mismatch
- The original language label comes from YouTube metadata, which describes the video, not each segment
- For TTS, code-mixed segments are perfectly valid training data -- you just need accurate per-segment language labels

This means the real TTS-usable dataset is likely **much larger than 20%** once we relax the mismatch criterion intelligently.

## Phase 1: Quality Deep-Dive (investigate before fixing)

Understand the actual data distribution before changing any filters:

- **Lang mismatch breakdown**: For each expected language, what does Gemini detect instead? (e.g., how much "hi" content is detected as "en"? Is it code-mixing or actual misclassification?)
- **Script validation cross-check**: The validator already checks if the text contains the expected script characters. Segments that have correct script but "wrong" detected language are almost certainly code-mixed, not mislabeled.
- **TTS eligibility without lang mismatch gate**: Re-compute TTS eligibility ignoring the lang_mismatch flag to see actual yield.
- **Boundary score distribution**: How many segments have abrupt starts/ends?
- **Per-language quality**: Are some languages significantly worse than others?

This is all SQL queries against the existing `transcription_results` table -- no reprocessing needed. Given the 73M row table size, we should either add indexes or export a representative sample for analysis.

## Phase 2: Re-tier the Data

Based on Phase 1 findings, define new quality tiers:

- **Tier A (TTS-pristine)**: quality >= 0.9, boundary >= 0.9, same-script confirmed, no UNK/inaudible, single speaker. For clean single-speaker TTS.
- **Tier B (TTS-expressive)**: quality >= 0.8, allows code-mixing (detected_lang in [expected, "en"]), allows up to 1 event tag. For expressive/code-mixed TTS.
- **Tier C (ASR-grade)**: quality >= 0.5, any language mix. For ASR training where messier data is fine.
- **Tier D (flagged)**: quality < 0.5 or empty. Skip.

The tiering can be done as a SQL UPDATE adding a `tier` column, or computed at export time.

## Phase 3: Data Export Pipeline

Export from Supabase + R2 into training-ready format. Two main options:

**Option A: HuggingFace Dataset format** (recommended for sharing/reproducibility)
- Parquet files with columns: audio_path, text, language, speaker_id, duration_s, quality_tier, emotion, pace
- Audio files organized by language/video_id
- Per-language train/val/test splits (e.g., 95/2.5/2.5)
- Push to HuggingFace Hub for easy loading

**Option B: WebDataset format** (recommended for large-scale training)
- `.tar` shards with paired `.flac` + `.json` per segment
- Streaming-friendly, no need to download entire dataset
- Works well with PyTorch DataLoader

For either option, the export pipeline:
1. Query Supabase for all segments in target tier(s), grouped by language
2. For each video, download the `_transcribed.tar` from R2 (already contains polished audio + per-segment JSONs)
3. Re-package into the target format with proper splits
4. For long-context: recombine splits by grouping on original filename, concatenate audio + text in order

This is another distributed workload (507K tars to download and repackage), but much lighter than transcription -- pure IO, no API calls.

## Phase 4: Training Setup

Once data is exported:

- **ASR**: Fine-tune Whisper or IndicConformer on Tier B+C data (~60-70M segments). Standard CTC/attention training.
- **TTS**: Fine-tune VITS/XTTS/StyleTTS2 on Tier A+B data per language. Requires speaker clustering first (group segments by speaker embedding similarity for multi-speaker models).
- **Long-context TTS**: Use recombined segments (original pre-split audio from R2 input bucket + concatenated transcriptions) for models that support >15s context.

## Recommended Execution Order

Start with Phase 1 (pure analysis, no code changes, a few hours of SQL work) to understand the real data quality before committing to export format or training approach. The lang mismatch investigation alone could change the TTS yield from 20% to 60-80%.