# Building human-quality TTS for Indic languages at scale

Training a production-ready text-to-speech model for 11 Indic languages from 500,000 YouTube podcast videos requires a carefully orchestrated pipeline spanning transcription, validation, normalization, and emotion annotation. **The most effective approach combines Gemini models for transcription with AI4Bharat's IndicMFA for forced alignment validation, using a multi-representation transcript format that preserves both native scripts and normalized spoken forms.** This strategy, informed by how CosyVoice, FireRedTTS, and Inworld TTS-1 structure their training data, enables code-mixing support, emotion controllability, and robust quality filtering without ground truth.

The key insight from state-of-the-art multilingual TTS systems is that emotion and style capabilities should be introduced during supervised fine-tuning (SFT), not pre-training—while script handling and code-mixing data should be present from the earliest training stages. For your 500K segment pipeline, expect to process approximately **4,200 hours of audio**, achievable in 2-3 days with proper parallelization.

## Gemini transcription requires strict verbatim prompting to prevent hallucination

Gemini models (2.5 Flash, 3 Flash, 3 Pro) can transcribe audio at **32 tokens per second**, supporting FLAC format natively with up to 9.5 hours per prompt. However, LLM transcription introduces three distinct hallucination modes: insertion errors (adding words not spoken), substitution errors (semantic "corrections"), and oscillation patterns (repetitive loops like "ay ay ay"). Even at temperature=0, these occur because models are trained for fluency over strict accuracy.

The anti-hallucination prompt strategy that works in production emphasizes legal/evidentiary framing:

```
You are a verbatim audio transcription system for Indic language podcasts.
Your output will be used for TTS/ASR training data - accuracy is critical.

RULES:
1. Transcribe EXACTLY what is spoken in {LANGUAGE}
2. Output in native script (not romanized)
3. DO NOT translate, correct, or embellish
4. Include all filler words, false starts, and repetitions
5. Mark unclear audio as [अस्पष्ट] or equivalent in target language
6. For code-switching (mixing languages), transcribe as spoken
7. NO creative additions - this is legal evidence-grade transcription
```

**Gemini 3 Flash** should serve as your primary transcriber (90-95% of Pro's accuracy at ~4x lower cost), with cascading to **Gemini 3 Pro** for segments flagged with confidence scores below 0.6 or repetition patterns detected. Set `thinking_level: "minimal"` to reduce over-reasoning that introduces changes.

For language coverage, Gemini officially supports **9 of your 11 target languages** (Telugu, Hindi, Kannada, Tamil, Malayalam, Bengali, Gujarati, Marathi, and the related Urdu). Punjabi, Odia, and Assamese lack official support—test actual performance, and consider AI4Bharat's Sarvam models or Google Cloud Speech-to-Text as fallbacks for these three.

## AI4Bharat's IndicMFA provides comprehensive forced alignment for all 11 languages

Forced alignment serves dual purposes: creating word/phoneme-level timestamps for TTS training and validating LLM transcription quality through alignment success rates. **IndicMFA emerges as the clear winner**, providing pretrained models for all 22 scheduled Indian languages with **42-313 hours of training data per language**.

| Language | IndicMFA Training Hours | Data Quality |
|----------|------------------------|--------------|
| Bengali | 313 hours | Excellent |
| Assamese | 303 hours | Excellent |
| Tamil | 300 hours | Excellent |
| Telugu | 262 hours | Excellent |
| Hindi | 255 hours | Excellent |
| Marathi | 213 hours | Very Good |
| Malayalam | 197 hours | Very Good |
| Kannada | 194 hours | Very Good |
| Odia | 132 hours | Good |
| Punjabi | 124 hours | Good |
| Gujarati | 43 hours | Adequate |

IndicMFA's key innovation is its **grapheme-to-grapheme (G2G) approach**, which maps each Unicode character directly to itself rather than requiring complex phoneme dictionaries. This eliminates the need for language-specific G2P rules while handling schwa deletion, gemination, and nasalization implicitly through the acoustic model.

For confidence scoring, track these alignment signals:
- **Alignment failure rate** per batch (>5% indicates transcription problems)
- **Word coverage ratio** (aligned words / expected words < 0.9 warrants review)
- **Timing plausibility** (speech rate outside 1-4 words/second is suspicious)
- **Gap ratio** (>30% silence suggests audio-text mismatch)

WhisperX can serve as a secondary aligner for cross-validation—it uses wav2vec2.0 models and works with the **CLSRIL-23** checkpoint covering 23 Indic languages. However, benchmark comparisons show MFA achieves higher alignment accuracy at finer temporal resolution (~10ms vs WhisperX's coarser boundaries).

## Native scripts should be primary with romanization as secondary representation

The question of native script versus romanization for TTS training has a clear answer from production systems: **use native scripts as the primary representation** while storing romanized versions as secondary. CosyVoice handles this through language-agnostic BPE tokenization that processes any script, while AI4Bharat's Indic-TTS trains exclusively on native scripts across 21 languages.

Native script advantages include preserving phonetic distinctions (aspirated vs unaspirated consonants, gemination) that romanization loses, maintaining consistent orthography (avoiding "namasthe" vs "namaste" variations), and enabling direct phoneme mapping. The limitation—keyboard accessibility—matters for user input but not training data.

For code-mixed speech like "are bhaai, kya kar rhe ho?", research from the "Code-Mixed TTS under Low-Resource Constraints" paper demonstrates that **single-script transliteration works remarkably well**:

1. Transliterate ALL text to the dominant language's script (e.g., all Hinglish → Devanagari)
2. Train a single-script TTS model on the combined data
3. At inference, transliterate code-mixed input before synthesis

This approach enables bilingual TTS without requiring labeled code-mixed recordings. Use **AI4Bharat IndicXlit** for transliteration—an 11M parameter transformer achieving 15% improvement over prior state-of-the-art on the Dakshina benchmark.

For mixed-script scenarios like Telugu with embedded Hindi (`అరే అలా కాదు, దీన్ని హిందీలో 'मैं सेब खाता हूँ' అంటారు`), preserve the original Unicode and apply language-specific normalizers from Indic NLP Library. CosyVoice and SeamlessM4T handle such inputs through unified multilingual tokenization without explicit language tags—the model learns to switch naturally.

## Emotion annotation vocabulary should start with 15 stable tags

Based on CosyVoice 3, Inworld TTS-1, and FireRedTTS implementations, a **15-tag vocabulary** balances expressiveness with annotation reliability:

**Emotional states (8):** `[happy]`, `[sad]`, `[angry]`, `[fearful]`, `[surprised]`, `[neutral]`, `[disgusted]`, `[confused]`

**Audio events (7):** `[laughter]`, `[breath]`, `[cough]`, `[sigh]`, `[cry]`, `[sneeze]`, `[yawn]`

Tag format conventions vary across systems—CosyVoice uses `[tag]`, Orpheus-TTS uses `<tag>`, OpenAudio S1 uses `(tag)`. For Indic TTS, the bracket notation `[tag]` is recommended for simplicity and compatibility.

For automatic emotion detection, **SenseVoice** (from FunAudioLLM) provides multi-task ASR + emotion recognition + audio event detection at 15x Whisper speed, though it currently supports Chinese, English, Japanese, Korean, and Cantonese—not Indic languages directly. For Hindi, models trained on the IITKGP-SEHSC corpus (4 classes: happy, sad, fear, anger) offer starting points. Set confidence thresholds at **0.5 probability** minimum, with human validation for 10-20% of automatically labeled data.

To handle hallucinated emotion annotations, implement these safeguards:
- Validate tags against your known vocabulary (reject `<burp>` if not in vocabulary)
- Require minimum segment duration (0.2 seconds) for audio events
- Use ensemble detection (multiple models must agree)
- Track annotation rates per batch (sudden spikes indicate detector issues)

## Training curriculum should introduce emotions at SFT, not pre-training

CosyVoice 3 and Inworld TTS-1 converge on a three-stage training curriculum where different capabilities are introduced at specific phases:

**Stage 1: Pre-training (Base Model)**
- Data: Large-scale diverse audio (100K-1M hours)
- Include: Script handling, basic code-mixing, multi-speaker data
- Exclude: Emotion annotations, fine-grained control tags
- Objective: Learn general speech patterns, prosody, phoneme-to-acoustics

**Stage 2: Supervised Fine-Tuning (SFT)**  
- Data: 1,500-5,000 hours of high-quality, annotated data (per CosyVoice 3)
- Introduce: Emotion labels, style descriptions, instruction-following capability
- Format: Natural language descriptions ("Speak in a happy tone") or inline tags
- Objective: Learn controllable generation, zero-shot adaptation

**Stage 3: Post-training/Alignment**
- Methods: DPO, GRPO, or CosyVoice 3's DiffRO-EMO
- Rewards: WER (pronunciation), speaker similarity, emotion classification accuracy
- Critical insight: Improving emotion expression can adversely affect pronunciation—balance needed

For Indic languages specifically, AI4Bharat's Indic-Parler-TTS supports 12 emotion/style categories that map well to podcast content: Command, Anger, Narration, Conversation, Disgust, Fear, Happy, Neutral, Proper Noun, News, Sad, Surprise.

## Multi-representation JSONL format enables flexible downstream training

For 500K segments, JSONL format enables stream processing, parallelization, and easy appending. Store **four transcript representations** per segment:

```json
{
  "id": "podcast_001_segment_0042",
  "audio_path": "segments/podcast_001/0042.flac",
  "duration_ms": 5420,
  
  "transcripts": {
    "verbatim_native_script": "आज की तारीख २३ जनवरी २०२४ है",
    "verbatim_roman": "aaj kii taarikh 23 janvarii 2024 hai",
    "normalized_spoken": "आज की तारीख तेईस जनवरी दो हज़ार चौबीस है",
    "verbatim_with_emotions": "आज की तारीख [emphasis]तेईस[/emphasis] जनवरी दो हज़ार चौबीस है [breath]"
  },
  
  "quality_metrics": {
    "snr_db": 32.5,
    "nisqa_mos": 4.1,
    "has_music": false,
    "is_multi_speaker": false,
    "alignment_confidence": 0.94
  },
  
  "transcription_metadata": {
    "primary_model": "gemini-3-flash",
    "primary_confidence": 0.92,
    "validation_status": "accepted",
    "estimated_wer": 0.08
  }
}
```

Number normalization is critical: store both verbatim (`₹५००`) and spoken forms (`पाँच सौ रुपये`). Use Indic NLP Library for script normalization—and avoid Whisper's default normalizer, which **removes vowel signs (matra)** in Brahmic scripts, artificially inflating WER.

## Validation at 500K scale requires confidence cascading and strategic sampling

Without ground truth, validation relies on multi-model consensus, alignment success, and reference-free quality estimation. The recommended pipeline:

1. **Gemini 3 Flash transcription** with structured JSON output
2. **Confidence scoring** via token log-probabilities (available via API)
3. **Routing decisions:**
   - High confidence (≥0.9): Accept directly
   - Medium (0.75-0.9): Cross-validate with secondary model (IndicWhisper or MMS)
   - Low (<0.75): Escalate to Gemini 3 Pro or flag for review
4. **IndicMFA forced alignment** as final quality gate—alignment failure indicates transcription error
5. **NoRefER** for reference-free WER estimation using attention patterns

For human validation, use stratified sampling: review 0.1% of high-confidence segments, 2% of medium, and 10% of low-confidence segments. This achieves statistical coverage while focusing effort on risky data.

Audio quality filtering should apply these thresholds:
- **SNR ≥ 32 dB** for TTS-quality data (20-32 dB acceptable with enhancement)
- **Segment duration 4-10 seconds** optimal for attention-based TTS models
- **Single speaker** for voice cloning applications (multi-speaker for conversational TTS)
- **No background music** or significant reverberation

Tools for automated quality estimation include HuggingFace's DataSpeech (SNR, reverberation, speech rate) and NISQA/SQUIM for MOS prediction without reference audio.

## Conclusion: Production pipeline orchestration

The complete pipeline for your 500K YouTube podcast segments should flow through six stages: ingestion (extract FLAC, segment by VAD), audio quality analysis (DataSpeech metrics), multi-model transcription (Gemini Flash primary, cascading to Pro), validation and consensus (alignment + confidence scoring), text normalization (Indic NLP Library), and output formatting (JSONL with multi-representation transcripts).

Parallelize using Flyte map tasks or Ray, with per-batch checkpointing every 100-500 batches. Store processed segments in **1-5 GB TAR archives** with JSONL manifests. Expected throughput: ~4,200 hours of audio processable in 2-3 days on a properly configured GPU cluster.

The novel insights from this research that diverge from common practice: IndicMFA's G2G approach eliminates the pronunciation dictionary bottleneck that has historically limited Indic forced alignment; single-script transliteration enables code-mixed TTS without labeled code-mixed recordings; and emotion capabilities should be introduced during SFT rather than pre-training, despite the intuition that more training data exposure would help. These patterns, validated across CosyVoice, FireRedTTS, and Inworld implementations, provide a production-ready blueprint for building human-quality Indic TTS at scale.