---
name: Post-Transcription Quality Pipeline
overview: Quality analysis, filtering, and data export pipeline to turn 507K transcribed videos (~73M segments) into training-ready datasets for both TTS and ASR across 12 Indic languages.
todos:
  - id: export-local
    content: "Step 0: Export transcription_results from Supabase to local Parquet (~22 GB, ~30-60 min over network)"
    status: pending
  - id: quality-deep-dive
    content: "Phase 1: Quality deep-dive on local Parquet (lang mismatch pairs, TTS yield recalc, boundary/script/per-language analysis)"
    status: pending
  - id: retiering
    content: "Phase 2: Re-tier data locally (A/B/C/D), write back tier column to Supabase or save as new Parquet"
    status: pending
isProject: false
---

# Post-Transcription: Quality Analysis, Filtering, and Training Data Pipeline

## Where We Are

```
507,387 videos transcribed | ~73M segments | 12 languages
99% quality >= 0.9 | 99% ASR eligible | 20% TTS eligible | 79% lang mismatch
```

The 20% TTS yield and 79% lang mismatch are the key issues to investigate and address.

## The Lang Mismatch Problem

The TTS eligibility check in [src/validator.py](src/validator.py) (line 180-186) requires `not result.lang_mismatch`. The mismatch flag fires when Gemini's `detected_language` differs from the video queue's `expected_language`. This is overly strict because:

- **Code-mixed speech is the norm** for Indic languages (Hindi+English, Tamil+English, etc.)
- A "Hindi" video where the speaker says one English sentence gets flagged as mismatch
- The original language label comes from YouTube metadata, which describes the video, not each segment
- For TTS, code-mixed segments are perfectly valid training data -- you just need accurate per-segment language labels

This means the real TTS-usable dataset is likely **much larger than 20%** once we relax the mismatch criterion intelligently.

## Step 0: Export to Local Parquet

Export the full `transcription_results` table (~73M rows) to local Parquet files for instant analysis. Supabase queries timeout on full-table aggregates, so local is the way.

- **Method**: `psql` with `COPY ... TO STDOUT` piped to CSV, then convert to Parquet via `pyarrow`
- **Columns**: All 42 columns including `transcription` and `tagged` text (needed for script analysis and later training)
- **Size**: ~22 GB as compressed Parquet, 529 GB disk free
- **Time**: ~30-60 minutes (network-bound, streaming 56 GB from Supabase PostgreSQL)
- **Output**: `/home/ubuntu/transcripts/data/transcription_results.parquet` (or sharded by language)

Alternatively, export in batches by language to parallelize and get usable data sooner.

## Phase 1: Quality Deep-Dive (local Parquet, instant queries)

All analysis runs locally on the Parquet file using DuckDB or pandas -- no more Supabase timeouts.

- **Lang mismatch breakdown**: For each `(expected_language_hint, detected_language)` pair, count segments. Identify code-mixing patterns vs actual misclassification.
- **Script validation cross-check**: For segments with lang_mismatch, check if the transcription text actually contains the expected script characters. If yes, it's code-mixed (valid), not mislabeled.
- **TTS yield recalculation**: Re-compute TTS eligibility ignoring the lang_mismatch gate. Expected to jump from 20% to 60-80%.
- **Boundary score distribution**: How many segments have abrupt starts/ends?
- **Per-language quality breakdown**: Quality scores, ASR/TTS eligibility rates, UNK density by language.
- **Text length distribution**: chars_per_second distribution, identify outliers.
- **Split segment analysis**: How many segments are splits, distribution of split counts per original.

## Phase 2: Re-tier the Data

Based on Phase 1 findings, define new quality tiers:

- **Tier A (TTS-pristine)**: quality >= 0.9, boundary >= 0.9, same-script confirmed, no UNK/inaudible, single speaker. For clean single-speaker TTS.
- **Tier B (TTS-expressive)**: quality >= 0.8, allows code-mixing (detected_lang in [expected, "en"]), allows up to 1 event tag. For expressive/code-mixed TTS.
- **Tier C (ASR-grade)**: quality >= 0.5, any language mix. For ASR training where messier data is fine.
- **Tier D (flagged)**: quality < 0.5 or empty. Skip.

The tiering can be done as a SQL UPDATE adding a `tier` column, or computed at export time.

## Phase 3: Data Export Pipeline

Export from Supabase + R2 into training-ready format. Two main options:

**Option A: HuggingFace Dataset format** (recommended for sharing/reproducibility)

- Parquet files with columns: audio_path, text, language, speaker_id, duration_s, quality_tier, emotion, pace
- Audio files organized by language/video_id
- Per-language train/val/test splits (e.g., 95/2.5/2.5)
- Push to HuggingFace Hub for easy loading

**Option B: WebDataset format** (recommended for large-scale training)

- `.tar` shards with paired `.flac` + `.json` per segment
- Streaming-friendly, no need to download entire dataset
- Works well with PyTorch DataLoader

For either option, the export pipeline:

1. Query Supabase for all segments in target tier(s), grouped by language
2. For each video, download the `_transcribed.tar` from R2 (already contains polished audio + per-segment JSONs)
3. Re-package into the target format with proper splits
4. For long-context: recombine splits by grouping on original filename, concatenate audio + text in order

This is another distributed workload (507K tars to download and repackage), but much lighter than transcription -- pure IO, no API calls.

## Phase 4: Training Setup

Once data is exported:

- **ASR**: Fine-tune Whisper or IndicConformer on Tier B+C data (~60-70M segments). Standard CTC/attention training.
- **TTS**: Fine-tune VITS/XTTS/StyleTTS2 on Tier A+B data per language. Requires speaker clustering first (group segments by speaker embedding similarity for multi-speaker models).
- **Long-context TTS**: Use recombined segments (original pre-split audio from R2 input bucket + concatenated transcriptions) for models that support >15s context.

## Recommended Execution Order

Start with Phase 1 (pure analysis, no code changes, a few hours of SQL work) to understand the real data quality before committing to export format or training approach. The lang mismatch investigation alone could change the TTS yield from 20% to 60-80%.