## 1) What you should store per segment (so you don’t paint yourself into a corner)

Treat “transcript” as a **bundle** of fields, not one string. Minimum bundle (works for ASR + TTS + future controllability):

1. **orthographic_mixed** (the “truthy” one)

* Write each span in the script people actually use for that language, **but keep real English words in Latin**.
* This is usually the best training target for **in-the-wild** code-mix.

2. **orthographic_native** (canonical script per language)

* Same content, but for each non-English span you output the language’s native script.
* Useful for language-specific text pipelines and evaluation.

3. **romanized** (lossy-but-convenient)

* One consistent romanization scheme (pick one; don’t mix ad-hoc spellings).
* Useful for debugging, search, and some model interfaces.

4. **normalized_for_tts** (spoken form, not written form)

* Numbers expanded, abbreviations expanded, symbols spoken (when actually spoken).
* This mirrors how Common Voice-style corpora avoid digits/abbr ambiguity for speech modeling. ([commonvoice.mozilla.org][1])

Optional but high value:

5. **language_spans**: `[ {lang, script, text, roman, normalized} ... ]`

* This is how you solve “Telugu sentence containing a Hindi quote in Devanagari” without hacks.

6. **events**: a *separate* list like `[ {event, start_ms, end_ms, confidence} ]`

* Keep events **out of the main text** at first; later you can inject markup for controllability.

---

## 2) The core strategy: generate with Gemini, validate with aligners + cross-ASR agreement

### A. Don’t trust self-reported confidence from the LLM

Even at temperature 0, outputs can vary and the model can still “helpfully” normalize or paraphrase. Vertex AI explicitly notes temperature 0 is *mostly* deterministic, not fully deterministic. ([Google Cloud Documentation][2])
So the winning pattern is: **LLM transcription + independent validators**.

### B. Validators that scale to your data volume

**Validator 1 — Forced alignment score (primary gate)**
You want an alignment method that works across Indic languages and code-mix:

* **CTC forced alignment (recommended at your scale)** using `torchaudio.functional.forced_align()` / CTC segmentation. This is designed for aligning a known transcript to audio and supports multilingual workflows. ([docs.pytorch.org][3])
* Pair it with an ASR acoustic model that covers your languages (next point).

**Validator 2 — Use MMS for both LID + alignment backbone**
Meta’s **MMS** project provides (a) multilingual ASR for 1,100+ languages and (b) language ID models for 4,000+ languages. ([arXiv][4])
Practical use:

* Run **MMS-LID** on each segment and compare to your expected language bucket.
* Use **MMS-ASR** (language-conditioned where possible) as the acoustic model for CTC alignment and/or as a second opinion transcript.

**Validator 3 — Cross-ASR agreement (cheap, strong signal)**

* Transcript with Gemini → `T_g`
* Transcript with MMS or Whisper → `T_a`
* Normalize lightly (punctuation stripping, Unicode NFKC, whitespace) and compute CER/WER agreement.
  If `distance(T_g, T_a)` is small *and* alignment score is high → accept. If not → escalate.

**Escalation policy (cost control):**

1. Gemini Flash → if pass gates, accept
2. If fail: re-run once (same settings)
3. If still fail: Gemini Pro (or 2.5 Pro) only for those segments
4. If still fail: drop segment (you have abundance)

This “selective spend” is exactly how big data pipelines stay sane.

---

## 3) Forced aligners: do you need per-language forced aligners?

### Option 1 (best ROI): CTC alignment + multilingual acoustic models

* No lexicon/G2P headaches.
* Works for code-mix if your tokenization supports it.
* You get a scalar “how well does this text explain this audio” score.

TorchAudio provides the forced alignment APIs and tutorials explicitly for this use case. ([docs.pytorch.org][3])

### Option 2: Montreal Forced Aligner (MFA)

MFA is excellent when you want phone-level boundaries and language-specific lexicons; it’s built on Kaldi and supports training your own acoustic/G2P/language models from your data. ([Montreal Forced Aligner][5])
Downside: you’ll likely end up maintaining **multiple** language packs + dictionaries (and code-mix makes lexicon rules messy).

**Recommendation:** start with CTC alignment everywhere; only use MFA for a curated “gold” subset where you want very high-quality phone timing.

---

## 4) How to stop Gemini from being “creative”: structured outputs + hard constraints

### A. Use JSON Schema / structured output (non-negotiable)

Gemini supports **structured outputs that adhere to a provided JSON Schema**. ([Google AI for Developers][6])
This eliminates:

* extra prose
* format drift
* “helpful” commentary

### B. Restrict event tags with enums

Your hallucination risk explodes if you let it invent tags. Use a fixed enum list and reject anything outside.

### C. Prefer “verbatim” instructions + explicit uncertainty token

Tell it: if uncertain, output `<unk>` for that span rather than guessing. Then your aligner will punish `<unk>` less than wrong words, and you can route those segments to Pro.

Example schema (illustrative):

```json
{
  "type": "object",
  "required": ["language_spans", "orthographic_mixed", "normalized_for_tts"],
  "properties": {
    "orthographic_mixed": { "type": "string" },
    "normalized_for_tts": { "type": "string" },
    "language_spans": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["lang", "script", "text"],
        "properties": {
          "lang": { "type": "string", "description": "BCP-47 like hi, te, ta, kn, ml, en" },
          "script": { "type": "string" },
          "text": { "type": "string" },
          "roman": { "type": "string" },
          "normalized": { "type": "string" },
          "uncertain_tokens": { "type": "array", "items": { "type": "string" } }
        }
      }
    },
    "events": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["event", "start_ms", "end_ms", "confidence"],
        "properties": {
          "event": {
            "type": "string",
            "enum": ["laugh", "giggle", "cough", "sneeze", "breath", "sigh", "cry", "yawn", "noise", "music"]
          },
          "start_ms": { "type": "integer" },
          "end_ms": { "type": "integer" },
          "confidence": { "type": "number" }
        }
      }
    }
  }
}
```

**Generation params:**

* Lower temperature reduces randomness, but doesn’t guarantee determinism. ([Google Cloud Documentation][2])
* For Gemini 3 specifically, Google cautions that moving temperature below default may cause unexpected behavior in some tasks. (Test this empirically for transcription.) ([Google AI for Developers][7])

---

## 5) Scripts + code-mix: handle your examples cleanly with span tagging

### Example 1: “are bhaai, kya kar rhe ho ?”

What you want is *language identity* (Hindi) even if written roman.

Store:

* `orthographic_mixed`: `are bhaai, kya kar rhe ho?`
* `orthographic_native`: `अरे भाई, क्या कर रहे हो?`
* spans:

  * `{lang:"hi", script:"Latn", text:"are bhaai, kya kar rhe ho?"}`
  * and optionally also provide a Devanagari projection in `orthographic_native`

### Example 2: “arey annai, em chestunnav ?”

Same pattern, but Telugu. Store Telugu-native in `orthographic_native` and roman in `orthographic_mixed/romanized`.

### Example 3 (your tricky one):

“are ala kadu, denni hindi lo ‘मैं सेब खाता हूँ’ antaru”

This is exactly why you need spans:

* Span A: Telugu (likely Latin-script transliteration or native Telugu depending on what was spoken/written)
* Span B: Hindi quote in Devanagari
* Span C: Telugu again

Your model can later be trained to consume:
`<L:te> ... <L:hi> ... <L:te> ...`
so it won’t mispronounce the Hindi quote using Telugu phonology.

This is also consistent with code-mixed TTS literature: handling code-mix requires dealing with multiple languages/scripts in the same utterance. ([lrec-conf.org][8])

---

## 6) Emotions / non-speech events: how to add without wrecking your corpus

### A. Don’t start by asking the LLM to sprinkle tags everywhere

Classic transcription conventions keep non-speech tags **very limited** for a reason (precision > recall). Switchboard guidelines, for example, used a small bracketed set like `[laughter]` and similar tokens. ([isip.piconepress.com][9])

### B. Use detector-first, LLM-second

At your scale, the safest path is:

1. Run an **audio event detector** (you already use PANNs; good) ([arXiv][10])
2. Only if event confidence is high, allow the transcript bundle to include that event in the `events` list
3. Optionally ask Gemini to place *positions* (timestamps or “before word X”) **conditioned on the detector output**, not from scratch.

### C. When to “fuse” these capabilities into the model

Follow the playbook you already cited:

* **Pretraining:** keep it mostly clean text↔speech token modeling; no noisy markups.
* **SFT for controllability:** introduce a curated subset with markup tags (emotion + nonverbal) once you trust the tagger.
  This matches how modern systems treat expressive control as a later-stage capability, not something you contaminate the whole pretrain set with.

This is aligned with how **CosyVoice 3** discusses post-training strategies (DiffRO) and using supervised multi-task signals like SER/AED in tokenizer training. ([arXiv][11])
And Inworld TTS-1 explicitly supports fine-grained emotional control + non-verbal vocalizations via markups, trained through staged pretrain→SFT→alignment. ([arXiv][12])

---

## 7) A concrete end-to-end transcription pipeline you can implement

For each `segment.wav/flac`:

1. **Run LID** (MMS-LID) → `lid_probs` ([arXiv][4])
2. **Gemini transcription** with:

   * structured JSON schema ([Google AI for Developers][6])
   * spans + `<unk>` policy
3. **CTC forced alignment** (TorchAudio) using MMS-ASR backbone → `align_score`, `%aligned`, “star insertions” ([docs.pytorch.org][3])
4. **Cross-ASR agreement**:

   * run MMS-ASR (or Whisper) transcript
   * compute CER/WER vs Gemini transcript
5. **Decision**:

   * if `align_score high AND agreement high AND lid consistent` → accept
   * else → escalate to Gemini Pro
   * else → drop segment
6. **Events**:

   * PANNs-based event probs ([arXiv][10])
   * only keep whitelisted events over threshold; store separately from text

Store everything in your tar’s `metadata.json` or a shard-level `.jsonl` so training can choose which text view to consume.

---

## 8) What forms you should request from the model (recommended “bundle”)

Use exactly these fields to start (your instinct is right):

* `orthographic_mixed` (primary)
* `orthographic_native`
* `romanized`
* `normalized_for_tts`
* `language_spans`
* `events` (enum-restricted, optional, detector-validated)

That gives you maximum downstream experiment freedom without forcing premature decisions.

If you want one extra that pays off later: add

* `pronunciation_hints` (optional): only for tokens marked uncertain (names, rare words). Keep it sparse.

---

If you want, I can also give you:

* acceptance thresholds (starter values) for align/agreement gates,
* a normalization spec (Unicode, punctuation, numeral expansion policy),
* and a “gold set” sampling plan to measure real WER across all 12 languages before you spend millions of API calls.

[1]: https://commonvoice.mozilla.org/en/guidelines?tab=spontaneous-speech&utm_source=chatgpt.com "Contribution Guidelines - Common Voice - Mozilla"
[2]: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/multimodal/content-generation-parameters?utm_source=chatgpt.com "Content generation parameters | Generative AI on Vertex AI"
[3]: https://docs.pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html?utm_source=chatgpt.com "CTC forced alignment API tutorial"
[4]: https://arxiv.org/pdf/2305.13516?utm_source=chatgpt.com "Scaling speech technology to 1000+ languages"
[5]: https://montreal-forced-aligner.readthedocs.io/en/v3.2.3/user_guide/workflows/index.html?utm_source=chatgpt.com "Workflows available - Montreal Forced Aligner - Read the Docs"
[6]: https://ai.google.dev/gemini-api/docs/structured-output?utm_source=chatgpt.com "Structured outputs | Gemini API - Google AI for Developers"
[7]: https://ai.google.dev/gemini-api/docs/gemini-3?utm_source=chatgpt.com "Gemini 3 Developer Guide | Gemini API"
[8]: https://www.lrec-conf.org/proceedings/lrec2016/pdf/13_Paper.pdf?utm_source=chatgpt.com "Speech Synthesis of Code-Mixed Text - LREC"
[9]: https://isip.piconepress.com/projects/switchboard/doc/transcription_guidelines/transcription_guidelines.pdf?utm_source=chatgpt.com "Rules and Guidelines for Transcription and Segmentation of ..."
[10]: https://arxiv.org/abs/1912.10211?utm_source=chatgpt.com "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition"
[11]: https://arxiv.org/html/2505.17589v1?utm_source=chatgpt.com "CosyVoice 3: Towards In-the-wild Speech Generation via ..."
[12]: https://arxiv.org/html/2507.21138v1?utm_source=chatgpt.com "TTS-1 Technical Report"
