This is an impressive engineering feat. You are essentially building an Indic-specialized dataset that rivals what Meta (Voicebox) and Alibaba (CosyVoice) have done, but with a smarter, leaner pipeline.

Given your scale (~120k hours) and your goal (human-like TTS with high controllability), **transcription is your single biggest point of failure.** If the text doesn't match the audio *perfectly* (including stuttering, laughs, and code-switching), your TTS model will hallucinate artifacts.

Here is the strategy to solve your Transcription, Code-Switching, and Validation challenges.

### **Phase 1: The "Strict-mode" Transcription Strategy**

You are right to be worried about LLM creativity. Temperature `0.0` is not enough. You need to constrain the model using **Structured Generation (JSON)** and a **"Critic" persona**.

**Don't use just one transcript.** You need a multi-view output to handle the code-switching and script issues effectively.

#### **1. The Transcription Schema (JSON)**

Instead of asking for one string, force Gemini to categorize the speech into specific buckets. This solves your "Hinglish vs. Native" dilemma by capturing **both**.

**Proposed JSON Structure:**

```json
{
  "transcription_verbatim_roman": "are bhaai, kya kar rhe ho? [laugh]",
  "transcription_verbatim_native": "अरे भाई, क्या कर रहे हो? [laugh]",
  "transcription_normalized_native": "अरे भाई, क्या कर रहे हो?",
  "primary_language_code": "hi",
  "is_code_mixed": true,
  "audio_events": ["laugh"],
  "speaker_emotion": "amused",
  "confidence_score": 9
}

```

#### **2. The "Anti-Hallucination" System Prompt**

You must instruct Gemini that it is **not** an assistant, but a **Forensic Transcriber**.

> **System Prompt:**
> "You are an automated ASR (Automatic Speech Recognition) system for a TTS dataset. Your ONLY goal is forensic accuracy.
> 1. **Verbatim is Law:** Transcribe exactly what is heard, including grammatical errors, repetitions (stuttering), and filler words (um, uh).
> 2. **Audio Events:** Detect human non-speech sounds and insert them as tags from this allowed list ONLY: `[laugh]`, `[breath]`, `[sigh]`, `[cough]`, `[clearing_throat]`, `[cry]`.
> 3. **Script Handling:**
> * `transcription_verbatim_roman`: Write exactly how a user would type this on a smartphone (Hinglish/Tanglish).
> * `transcription_verbatim_native`: Write the script native to the *spoken* language.
> * **Critical Rule for Code-Switching:** If a Telugu speaker says a Hindi phrase (e.g., "Main seb khata hoon"), write it in Devanagari inside the Telugu sentence for the `native` field, but use the Telugu script for the surrounding words.
> 
> 
> 4. **Numbers:** Convert "123" to spoken form "one hundred and twenty-three" (or native equivalent) in the `verbatim` fields.
> 5. **Hallucination Check:** If the audio is unintelligible or music, return an empty string. Do not invent text."
> 
> 

---

### **Phase 2: Solving the Code-Switching & Script Dilemma**

You asked how to handle: *“arey annai, em chestunnav?”* vs *“అరేయ్ అన్నయ్య, ఏం చేస్తున్నావ్?”*

**The CTO Strategy:**
You need **BOTH**. Modern TTS models (like CosyVoice/Inworld) use tokenizers that can technically handle mixed scripts, but training on "Romanized Indic" often degrades pronunciation quality because Roman characters are phonetically ambiguous for Indian languages.

* **Training Data Strategy:**
* **Primary Training:** Use `transcription_verbatim_native`. This grounds the model in the correct phonetics of the language (e.g., distinguishing retroflex sounds in Telugu/Hindi).
* **Robustness Training (10-15% of data):** Train on `transcription_verbatim_roman`. This teaches the model to handle "Hinglish" text input from users at inference time.
* **The "Telugu quoting Hindi" Scenario:**
* *Audio:* "antaru... main seb khata hoon"
* *Transcript (Native):* "... అంటారు 'मैं सेब खाता हूँ'"
* *Why:* Your tokenizer will likely be multilingual (e.g., 50k+ vocab size). It will see Telugu tokens, then Devanagari tokens. As long as your model architecture (like XCodec2 + LLM) supports these tokens, it will learn that Devanagari tokens trigger Hindi phonemes, even if the surrounding context is Telugu.





---

### **Phase 3: Validation (The Missing Link)**

You cannot blindly trust Gemini-3-Flash. It *will* hallucinate text for silent audio or interpret background noise as words.

**Solution: The MFA (Montreal Forced Aligner) Filter**
You don't need to train an aligner from scratch. Use **pretrained Indic models** (available via MFA or AI4Bharat) to validate the Gemini transcripts.

1. **Pipeline:**
* `Audio` + `Gemini Native Transcript` → **MFA Aligner**.


2. **Logic:**
* MFA attempts to align the phonemes of the transcript to the audio.
* **Metric:** Check the **Phone Confidence Score** or **Unaligned Word Count**.
* **Pass:** If > 90% of words align with high confidence.
* **Fail:** If MFA cannot align the text (usually means Gemini hallucinated a sentence that wasn't spoken).


3. **Action:** Discard "Fail" segments. Do not try to fix them; you have 120k hours, you can afford to lose 10% dirty data to ensure 90% purity.

---

### **Phase 4: Emotional & Audio Event Strategy**

You asked: *When to fuse emotions? SFT or Post-train?*

**Do it in Pre-training.**
For a "Human-like" model, these events (`[laugh]`, `[breath]`) are not "instructions"—they are part of the vocabulary.

1. **Tokenizer:** Add `[laugh]`, `[breath]`, etc., as **special tokens** in your tokenizer (just like `<EOS>`).
2. **Training:**
* Train the model on the text *containing* these tags: "Hello [breath] how are you? [laugh]".
* The model learns to predict the `[laugh]` token just like it predicts the word "you".
* When the codec decoder sees the audio features associated with that token, it generates the sound.


3. **Inference Control:**
* Because the model learned `[laugh]` as a token, you can now *force* a laugh by inserting that tag in the input text prompt.



**Recommendation:** Do not rely on a separate "Emotion Encoder" (like a Reference Audio style transfer) alone. Integrating discrete event tokens into the text sequence gives you far finer control.

---

### **Summary of Next Steps**

1. **Select Model:** **Gemini 1.5 Pro** (or 2.5/3 Pro) is significantly better at following complex JSON schemas and "verbatim" instructions than Flash. Given you are processing snippets (not long context), the cost might be higher, but for *training data*, quality > quantity. **Test Flash first with the prompt below; if it hallucinates, switch to Pro.**
2. **Deploy Worker:** Create a Cloudflare Worker that sends the audio URL to Gemini with the schema below.
3. **Validate:** Set up a small VPS running **MFA** to spot-check the first 1,000 generated transcripts.

**The Golden Prompt for your Worker:**

```text
Input: Audio File
Task: Transcribe the speech for a Text-to-Speech model training dataset.

Constraints:
1.  VERBATIM ONLY: Do not paraphrase. Keep stutters, repetitions, and grammar errors.
2.  SCRIPT:
    - If the speaker speaks Telugu, use Telugu script.
    - If the speaker switches to English (Code-switching), use English script for those specific words.
    - If the speaker quotes Hindi, use Devanagari script for those words.
3.  EVENTS: Insert tags [laugh], [breath], [cough], [giggle] exactly where they occur.
4.  FORMAT: Return a JSON object.

JSON Schema:
{
  "transcription_native": "The mixed script string (e.g. Telugu + English words)",
  "transcription_romanized": "The romanized version of the entire text",
  "contains_code_mixing": boolean,
  "detected_emotions": ["happy", "neutral", "sad"],
  "audio_quality_score": (1-10 integer, assess background noise/clarity)
}

```

This strategy aligns with your "podcast quality" assumption while building safety rails against LLM hallucinations. Validating with MFA is the industry standard for this exact workflow.