Bharath, this is an incredibly well-thought-out pipeline, and you're asking exactly the right questions at this stage. Let me give you a comprehensive strategy for transcription that addresses all your concerns about scripts, code-mixing, emotions, and validation.

## Core Transcription Strategy

### Multi-Pass Architecture with Validation

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    TRANSCRIPTION PIPELINE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Audio Segment                                                           │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────┐                                                    │
│  │ Gemini-3-Flash  │──────────────────┐                                 │
│  │ (Primary ASR)   │                  │                                 │
│  └────────┬────────┘                  │                                 │
│           │                           │                                 │
│           ▼                           ▼                                 │
│  ┌─────────────────┐         ┌─────────────────┐                       │
│  │ Forced Aligner  │         │ Audio Classifier │                       │
│  │ (MFA/Custom)    │         │ (emotion2vec)    │                       │
│  └────────┬────────┘         └────────┬────────┘                       │
│           │                           │                                 │
│           ▼                           ▼                                 │
│  ┌─────────────────────────────────────────────┐                       │
│  │         Confidence Aggregation              │                       │
│  │  • Alignment Score (phoneme boundaries)     │                       │
│  │  • Event Tag Agreement                      │                       │
│  │  • Language Detection Confidence            │                       │
│  └────────────────────┬────────────────────────┘                       │
│                       │                                                 │
│           ┌───────────┴───────────┐                                    │
│           │                       │                                    │
│     Score > 0.85            Score < 0.85                               │
│           │                       │                                    │
│           ▼                       ▼                                    │
│     ┌──────────┐          ┌─────────────────┐                         │
│     │  ACCEPT  │          │ Gemini-3-Pro    │                         │
│     └──────────┘          │ (Arbitration)   │                         │
│                           └────────┬────────┘                         │
│                                    │                                   │
│                                    ▼                                   │
│                           ┌─────────────────┐                         │
│                           │ Final Decision  │                         │
│                           │ • Accept        │                         │
│                           │ • Flag for      │                         │
│                           │   manual review │                         │
│                           │ • Discard       │                         │
│                           └─────────────────┘                         │
└─────────────────────────────────────────────────────────────────────────┘
```

---

## The Hallucination Problem & Solutions

You're absolutely right that temperature=0.0 doesn't eliminate creativity. Here's why and what to do:

### Why Temperature=0 Isn't Enough

LLMs have learned patterns like "completing thoughts" or "making sentences grammatically correct." Even at temperature=0, the model picks the *most likely* token, which might still be a hallucinated word that "fits" better than the actual spoken word.

### Multi-Layered Hallucination Prevention

**1. Prompt Engineering (Necessary but Insufficient)**
```
You are a VERBATIM transcription system. Your task is to transcribe EXACTLY what is spoken.

CRITICAL RULES:
- Transcribe ONLY sounds you hear. Never add words to "complete" a thought.
- If speech is unclear, use <unclear> tag, DO NOT guess.
- If speech is cut off, transcribe exactly what was said, don't complete it.
- Transcribe filler words: "um", "uh", "hmm" exactly as spoken.
- Transcribe false starts: "I was go- I was going" exactly.
- DO NOT correct grammar. If speaker says "I goes there", write "I goes there".
- DO NOT add punctuation that changes meaning.

OUTPUT FORMAT: [Your structured JSON]
```

**2. Forced Alignment Validation (Critical)**

This is your primary defense against hallucinations. Here's the strategy:

```python
# Pseudo-pipeline for validation
def validate_transcript(audio_path, transcript, language):
    # 1. Get phoneme-level alignment
    alignment = forced_aligner.align(audio_path, transcript, language)
    
    # 2. Calculate metrics
    alignment_score = alignment.confidence
    coverage = aligned_duration / total_audio_duration
    gap_ratio = sum(unaligned_gaps) / total_duration
    
    # 3. Flag issues
    issues = []
    if coverage < 0.90:
        issues.append("low_coverage")
    if gap_ratio > 0.15:
        issues.append("alignment_gaps")
    if alignment_score < 0.80:
        issues.append("low_confidence")
    
    return {
        "valid": len(issues) == 0,
        "score": alignment_score,
        "issues": issues,
        "word_alignments": alignment.words  # For debugging
    }
```

**3. Language-Specific Forced Aligners**

| Language | Recommended Aligner | Notes |
|----------|-------------------|-------|
| Hindi | MFA with Hindi acoustic model | Train custom on 50-100h verified data |
| Telugu/Tamil/Kannada | Custom trained | MFA lacks good Dravidian support |
| Bengali/Gujarati/Punjabi | MFA with custom lexicon | Need phoneme dictionary |

**Should you train your own aligners?** 

**Yes, absolutely.** Here's the bootstrap strategy:

1. Take 500-1000 segments per language where you have high confidence (e.g., from clean studio recordings or where multiple ASR systems agree)
2. Manually verify a subset (~100 segments)
3. Train language-specific acoustic models using MFA or Kaldi
4. Use these to validate the Gemini transcripts

---

## Script Strategy: The Complete Framework

This is where most teams mess up. Let me give you a robust framework.

### The Fundamental Decision: Native Script vs Romanized

**For TTS training, native script is superior because:**
- Unambiguous pronunciation (Hindi "kh" = ख or क्ह?)
- Consistent grapheme-to-phoneme mapping
- No transliteration variance (is it "ghar" or "ghr" or "gher"?)

**However, you need BOTH for flexibility.**

### Recommended Output Schema

```json
{
  "segment_id": "videoID_seg_001",
  "duration_ms": 4523,
  "language_primary": "te",
  
  "transcripts": {
    "native_verbatim": "అరే అలా కాదు, దీన్ని హిందీలో 'मैं सेब खाता हूँ' అంటారు",
    
    "native_normalized": "అరే అలా కాదు, దీన్ని హిందీలో 'मैं सेब खाता हूँ' అంటారు",
    
    "roman_verbatim": "are ala kaadu, deenni hindi lo 'main seb khaata hoon' antaaru",
    
    "roman_normalized": "are ala kadu, dinni hindi lo 'main seb khata hun' antaru",
    
    "with_events": "అరే [pause] అలా కాదు, దీన్ని హిందీలో [code_switch:hi] 'मैं सेब खाता हूँ' [/code_switch] అంటారు",
    
    "with_language_tags": "<te>అరే అలా కాదు, దీన్ని హిందీలో '<hi>मैं सेब खाता हूँ</hi>' అంటారు</te>",
    
    "phoneme_sequence": "a r eː | a l aː | k aː d u | d iː n n i | ..."
  },
  
  "metadata": {
    "contains_code_switch": true,
    "code_switch_languages": ["te", "hi"],
    "contains_roman_in_native": false,
    "detected_events": [
      {"type": "pause", "start_ms": 423, "end_ms": 678}
    ],
    "alignment_score": 0.92,
    "transcription_confidence": 0.89
  }
}
```

### Handling Code-Mixed Scenarios

Let me address your specific examples:

**Scenario 1: Roman Hinglish**
```
Input: "are bhaai, kya kar rhe ho?"
```

This is Hindi written in Roman script. Your model needs to:
1. Recognize this is NOT English
2. Know to pronounce it as Hindi

**Solution:** Language tagging with script indicator
```json
{
  "native": "अरे भाई, क्या कर रहे हो?",
  "roman": "are bhaai, kya kar rhe ho?",
  "tagged": "<hi-Latn>are bhaai, kya kar rhe ho?</hi-Latn>"
}
```

The tag `hi-Latn` means "Hindi language, Latin script" (following BCP-47 conventions).

**Scenario 2: Roman Tenglish**
```
Input: "arey annai, em chestunnav?"
```

Same approach:
```json
{
  "native": "అరేయ్ అన్నయ్, ఏం చేస్తున్నావ్?",
  "roman": "arey annai, em chestunnav?",
  "tagged": "<te-Latn>arey annai, em chestunnav?</te-Latn>"
}
```

**Scenario 3: Code-Switch with Native Scripts**
```
Input: "అరే అలా కాదు, దీన్ని హిందీలో 'मैं सेब खाता हूँ' అంటారు."
```

This is Telugu with embedded Hindi in Devanagari. Handle with:
```json
{
  "tagged": "<te>అరే అలా కాదు, దీన్ని హిందీలో '<hi>मैं सेब खाता हूँ</hi>' అంటారు.</te>",
  "segments": [
    {"text": "అరే అలా కాదు, దీన్ని హిందీలో '", "lang": "te"},
    {"text": "मैं सेब खाता हूँ", "lang": "hi"},
    {"text": "' అంటారు.", "lang": "te"}
  ]
}
```

### How Big Labs Handle This

**1. Google USM / Chirp:**
- Uses language embeddings (learned vectors for each language)
- Language ID model runs in parallel with ASR
- For code-switching: segment-level language detection + interpolated embeddings at boundaries

**2. Meta MMS:**
- 4000+ language support via adapter layers
- Character-level modeling reduces script issues
- Romanization handled via lookup tables for training

**3. CosyVoice (Most Relevant for You):**
- Uses explicit language instruction tokens: `<|zh|>`, `<|en|>`, `<|ja|>`
- For multilingual: concatenates language tokens before each segment
- Their supervised semantic tokenizer includes Language ID as a task

**Your Recommended Approach:**

Use **language instruction tokens** at the start and **boundary markers** for code-switching:

```
<|te|> అరే అలా కాదు, దీన్ని హిందీలో <|hi|> मैं सेब खाता हूँ <|te|> అంటారు.
```

The model learns:
- `<|te|>` = next tokens should use Telugu phonotactics
- Boundary switch = transition prosody/accent appropriately

---

## Emotion & Audio Events Strategy

### Define a Fixed Taxonomy

Don't let the LLM freestyle. Constrain to exactly these:

**Vocal Events (10):**
| Tag | Description | Audio Verification |
|-----|-------------|-------------------|
| `[laugh]` | Light laughter | Use laugh detection model |
| `[laugh_hard]` | Heavy laughter | Amplitude + duration |
| `[giggle]` | Short, high-pitched | Pitch analysis |
| `[chuckle]` | Low, brief | |
| `[cough]` | Coughing sound | Event classifier |
| `[clear_throat]` | Throat clearing | |
| `[sigh]` | Exhale with voice | Breath detection |
| `[gasp]` | Sharp inhale | |
| `[breath]` | Audible breathing | |
| `[hesitation]` | "um", "uh", etc. | Filler detection |

**Prosodic/Style Tags (6):**
| Tag | Description |
|-----|-------------|
| `[happy]` | Positive valence, high energy |
| `[sad]` | Negative valence, low energy |
| `[angry]` | Negative valence, high energy |
| `[excited]` | High arousal, fast speech |
| `[calm]` | Low arousal, measured |
| `[whisper]` | Reduced volume, breathy |

### Two-Model Validation for Events

```python
def validate_events(audio, transcript_with_events, detected_events_from_llm):
    # 1. Run audio classifier
    audio_events = emotion2vec.predict(audio)  # or PANN for non-speech
    
    # 2. Compare predictions
    agreed_events = []
    llm_only_events = []
    audio_only_events = []
    
    for event in detected_events_from_llm:
        if matches_audio_event(event, audio_events, tolerance_ms=500):
            agreed_events.append({**event, "confidence": "high"})
        else:
            llm_only_events.append({**event, "confidence": "low"})
    
    # 3. Return validated transcript
    return {
        "transcript_high_confidence": keep_only(transcript, agreed_events),
        "transcript_all_events": transcript_with_events,
        "validation": {
            "agreement_rate": len(agreed_events) / len(detected_events_from_llm),
            "audio_detected": audio_events
        }
    }
```

### Training Strategy for Emotions

Based on the reference models:

**Pre-training (1M hours):**
- Do NOT include emotion tags
- Focus on clean speech-text alignment
- Model learns base prosody from data naturally

**SFT Phase (5-15k hours):**
- Introduce emotion tags here
- Use high-confidence validated tags only
- Include neutral → emotional pairs
- This is where CosyVoice's 5000h emotion data comes in

**Post-training / RL:**
- Fine-tune emotion controllability
- Can use emotion classification reward
- FireRedTTS uses only ~15h of expressive speech in SFT

**My Recommendation for You:**
```
Pre-train: 100k+ hours, no emotion tags, clean transcripts
    ↓
SFT: 2-5k hours high-quality with validated emotion tags
    ↓
Post-train: Emotion-specific LoRA or full fine-tune on style pairs
```

---

## Complete Prompt Template for Gemini

Here's the prompt I recommend for getting all transcript forms at once:

```
You are MAYA-TRANSCRIBE, a multilingual verbatim transcription system for Indian languages.

LANGUAGE: {detected_language}
AUDIO DURATION: {duration_seconds}s

=== TRANSCRIPTION RULES ===

1. VERBATIM ACCURACY (CRITICAL):
   - Transcribe EXACTLY what is spoken, including:
     • Filler words: "अं", "उम्म", "like", "ना"
     • False starts: "मैं जा- मैं जाऊंगा"
     • Grammatical errors: Do NOT correct them
   - If unclear, mark as <unclear>word?</unclear>
   - If cut off, end with "—"

2. SCRIPT RULES:
   - Primary language in NATIVE SCRIPT
   - English words embedded in speech: keep in Roman
   - Numbers: Write as spoken ("twelve" not "12")
   - For code-switching, identify the embedded language

3. AUDIO EVENTS (Use ONLY these tags):
   [laugh] [giggle] [cough] [sigh] [gasp] [breath] [hesitation]
   [happy] [sad] [angry] [excited] [calm] [whisper]
   
   Place tags BEFORE the affected speech segment.

=== OUTPUT FORMAT ===

Return ONLY valid JSON, no markdown:
{
  "native_verbatim": "exact transcription in native script",
  "native_normalized": "native script with numbers/dates verbalized",
  "roman_transliteration": "romanized version using ISO 15919 style",
  "with_events": "native script with [event] tags inserted",
  "with_language_tags": "native script with <lang> markers for code-switch",
  "detected_languages": ["primary_lang", "embedded_lang1"],
  "detected_events": [
    {"type": "laugh", "approx_position": "after greeting"}
  ],
  "transcription_notes": "any uncertainties or observations",
  "confidence": 0.0 to 1.0
}

=== CRITICAL WARNINGS ===
- NEVER add words not spoken
- NEVER complete interrupted sentences  
- NEVER "clean up" non-standard grammar
- If you're uncertain, mark it and reduce confidence score

Now transcribe the audio.
```

---

## Data Collection Strategy: Get Everything Once

Since you're paying for Gemini API calls anyway, get all formats in one call:

```python
async def transcribe_segment(audio_path, language, segment_id):
    # 1. Get multi-format transcript from Gemini
    gemini_response = await gemini_transcribe(audio_path, language)
    
    # 2. Run forced alignment on native_verbatim
    alignment = await forced_align(
        audio_path, 
        gemini_response["native_verbatim"],
        language
    )
    
    # 3. Run audio event classifier
    audio_events = await classify_events(audio_path)
    
    # 4. Validate events
    validated_events = cross_validate_events(
        gemini_response["detected_events"],
        audio_events
    )
    
    # 5. Create final record with all variants
    return {
        "segment_id": segment_id,
        "transcripts": {
            "native_verbatim": gemini_response["native_verbatim"],
            "native_normalized": gemini_response["native_normalized"],
            "roman": gemini_response["roman_transliteration"],
            "with_events_validated": insert_validated_events(
                gemini_response["native_verbatim"],
                validated_events
            ),
            "with_events_all": gemini_response["with_events"],
            "language_tagged": gemini_response["with_language_tags"]
        },
        "validation": {
            "alignment_score": alignment.score,
            "word_alignments": alignment.words,
            "event_agreement_rate": len(validated_events) / max(1, len(gemini_response["detected_events"])),
            "gemini_confidence": gemini_response["confidence"]
        },
        "metadata": {
            "languages": gemini_response["detected_languages"],
            "events_validated": validated_events,
            "events_unvalidated": gemini_response["detected_events"]
        }
    }
```

---

## Summary: When to Use What

| Training Stage | Transcript Type | Why |
|---------------|-----------------|-----|
| **Codec Pre-training** | N/A (audio only) | No text needed |
| **TTS Pre-training** | `native_verbatim` | Clean, consistent mapping |
| **TTS SFT** | `with_events_validated` | Introduce controllability |
| **TTS Post-train** | `with_events_validated` + style pairs | Refine control |
| **ASR Training** | `native_normalized` | Handles number/date edge cases |
| **Code-switch Experiments** | `language_tagged` | Explicit boundaries |
| **Romanized Input Support** | `roman` + `native` pairs | Train transliteration capability |

### Final Recommendations

1. **Get all formats now** — Storage is cheap, API calls aren't. You can filter later.

2. **Build language-specific forced aligners** — Bootstrap from 500-1000 verified samples per language. This is your primary hallucination defense.

3. **For emotions: Conservative in pre-train, aggressive in SFT** — Only use validated events. The 15h of expressive speech FireRedTTS uses shows you don't need much for controllability.

4. **Language tokens > Script detection** — Make the model explicitly aware of language switches rather than hoping it infers from script changes.

5. **Store confidence scores** — This lets you do curriculum learning (high-confidence first) and filter bad samples later.

Want me to write the actual implementation code for any of these components — the forced aligner training pipeline, the Gemini transcription worker, or the event validation system?