Good. This is exactly the point where being annoyingly systematic saves you from setting money on fire with 80 million clips.

I went through every concern and turned it into decisions and TODOs. Here’s the plan I’d lock **before** implementation.

## My top-line calls

1. **Your authoritative transcript should be code-mixed in native scripts, not transliterated into one script.**
   Reason: this call is the expensive multimodal one, so it should preserve the highest-fidelity truth from audio. Transliteration is a lossy text transformation you can do later with cheaper text-only tooling. AI4Bharat IndicXlit and the Indic NLP Library already give you programmatic transliteration/romanization options later, so there is no reason to burn fidelity at the audio step. ([GitHub][1])

2. **Use the video language from Supabase only as a weak prior, never as a hard constraint.**
   The prompt should say “expected language hint,” not “target language.” If the segment is fully English inside a Telugu video, Gemini should output English and `detected_language="en"`.

3. **Keep punctuation, but do not normalize numbers/symbols into digits/special symbols in the authoritative field.**
   Use the spoken form. “2026” destroys pronunciation evidence that matters for TTS. “twenty twenty-six” or “two zero two six” preserves what was actually said.

4. **Do not add a model self-confidence field.**
   Keep `detected_language`; drop generic self-confidence. Model self-reports are noisy. External validators are better.

5. **Split the downstream dataset early into lanes.**
   One monolithic “good for ASR and TTS” bucket is how people ruin both.

   * `asr_core`: high-confidence transcript, can tolerate some mid-sentence starts
   * `tts_clean`: strict boundaries, no overlap, no tags/noise
   * `tts_expressive`: strict boundaries but allows approved event tags like laugh/cough

6. **Mainline inference should be standard API calls with inline audio bytes extracted from each tar, not per-segment file uploads.**
   Gemini supports inline audio for requests under 20 MB, which your 2 to 15 second FLAC clips should easily satisfy. The Files API is for larger/reused files, but using it per segment at 80M scale would be absurd in the special way only distributed systems can be. ([Google AI for Developers][2])

7. **Batch is a secondary lane, not the primary lane for a 4-day finish.**
   Gemini Batch is cheaper but explicitly asynchronous with a target turnaround of 24 hours; Vertex batch is also async and Cloud Storage centric. That makes batch useful for backlog/overflow, not as your only path when you have a 100-hour finish line. ([Google AI for Developers][3])

---

## Every concern you raised, converted into decisions and TODOs

### 1) Prompt structure and model settings

**Decision:** Keep one **multilingual core prompt** plus a tiny per-request language hint from Supabase.
Do **not** maintain 12 wildly different prompts unless later testing proves one language needs a special rule.

**TODO**

* Replace `TARGET: Telugu (te-IN)` with `EXPECTED_LANGUAGE_HINT: <lang> (<code>)`.
* Phrase it as: “Use this only as a weak prior. Trust the audio over the hint.”
* Move the actual JSON schema enforcement into `response_json_schema` / `responseSchema`.
* Remove the full schema block from the prompt text.

**Why:** Google explicitly recommends using the response schema in the API and warns that duplicating the schema inside the prompt can lower quality. Complex schemas also count toward input tokens and can trigger `400` errors if they get too elaborate. Gemini 3 Flash Preview supports structured output, system instructions, audio input, caching, thinking, and batch. ([Google Cloud Documentation][4])

**Model settings I’d start with**

* `model = gemini-3-flash-preview`
* `temperature = 0`
* `candidateCount = 1`
* `thinking_level = "low"` on Gemini 3 Flash
* structured JSON output via schema
* fixed `seed` only if you run on Vertex, and only as a best-effort repeatability helper

**Why:** Gemini 3 defaults to high thinking. For high-throughput instruction following, Google recommends lower thinking levels; `minimal` is even cheaper/faster but I would benchmark it only after a canary because transcription is still a multimodal fidelity task. Vertex also exposes a `seed` parameter, but determinism is still best-effort, not guaranteed. ([Google AI for Developers][5])

---

### 2) `transcription` as the source of truth

**Decision:** `transcription` should be the **authoritative, code-mixed, native-script transcript**.

**TODO**

* Keep each language in its own script.
* If a Telugu segment contains an English phrase, keep English in Latin script.
* If the whole segment is English despite Telugu metadata, output English and set `detected_language="en"`.
* Preserve fillers, repetitions, false starts, and abrupt cutoffs.
* Add only prosody-based punctuation: comma, period, question mark, exclamation point.

**Important refinement**

* **Numbers:** keep them in spoken words, not digits.
* **Symbols:** only output the spoken lexical form, not normalized symbols.

  * If the speaker says “percent,” write the word.
  * If the speaker literally spells digits, write what they said.

**Why:** This preserves audio-faithful pronunciation for ASR/TTS. Later normalization is easy. Reconstructing spoken form from digits is where people discover they accidentally trained a model to say “two thousand twenty-six” when the speaker actually said “twenty twenty-six.”

---

### 3) Language hinting and `detected_language`

**Decision:** Keep `detected_language`, but standardize it.

**TODO**

* Change `detected_language` to a **controlled set of ISO-style codes** or a very small enum-like set in practice:

  * `hi mr te ta kn ml gu pa bn as or en`
  * plus `no_speech`
  * plus `other`
* Downstream, compare `expected_language_hint` vs `detected_language`.
* Store a simple `lang_mismatch_flag` outside the model response.

**Why:** Free-form language names become messy fast: `Odia` vs `Oriya`, `Punjabi` vs `Panjabi`, etc. Standardized codes make filtering and QA sane.

**Decision on confidence:**
Do **not** ask Gemini for a generic confidence field.
Use validator scores later instead.

---

### 4) Code-mixed vs transliterated single script

**Decision:** **Code-mixed native script wins.** Hard stop.

**Why**

* It is the closest thing to an audio-grounded truth.
* Transliteration is a second interpretive task and can scramble borrowings, names, brand words, and accent-driven pronunciations.
* You already have later text-only conversion options with IndicXlit / Indic NLP. ([GitHub][1])

If you want consistency later, create:

* `transcription_authoritative`
* `transcription_romanized` (derived later)
* `transcription_normalized` (derived later)

Do **not** make the expensive audio call do all three jobs.

---

### 5) `tagged` field and event tags

**Decision:** `tagged` must be a **pure derivation** of `transcription`, not a second freeform transcription pass.

**TODO**

* Keep `tagged == transcription` when no event is present.
* Only insert tags at positions where events are clearly audible.
* Do not ask Gemini to “re-listen” or “backtrack” in the prompt. That tends to increase paraphrase risk instead of helping.

**Event set I’d use**

* `[laugh]`
* `[cough]`
* `[sigh]`
* `[breath]`
* `[clears_throat]`
* `[singing]`
* `[music]`
* `[applause]`
* `[noise]`

That is a strong enough set for podcasts without turning the model into a creative foley artist.

**Rule**

* Only if **prominent and unambiguous**
* Never stack a zoo of tags
* No guessing from context

**Why:** You want precision, not recall, on tags. False positive tags are poison for both ASR labeling and controllable TTS.

---

### 6) Speaker metadata: emotion, style, pace, accent

**Decision:** Keep:

* `emotion`
* `speaking_style`
* `pace`

Keep `accent` **optional in practice**, even if the field exists.

**TODO**

* Require empty string for accent unless confidence is high.
* Restrict to **broad regional/dialect labels**, not city-level or geo-guess cosplay.
* Do not add a geo-location field.

**Why:** Emotion/style/pace are acoustically observable enough to be useful. Accent is useful only when the model is really sure; otherwise it will happily invent a geography minor.

Gemini’s audio understanding docs explicitly position the model as capable of transcription, speaker detection/diarization-style tasks, and emotion detection. ([Google AI for Developers][2])

---

### 7) Segments are not perfectly single-speaker

**Decision:** Treat brief contamination and acknowledgements as **quality flags**, not as reasons to trust the segment blindly.

**TODO**

* If a second speaker briefly leaks in:

  * keep it for `asr_core` **only if** the main speaker still dominates and later validators pass
  * reject it from `tts_clean`
  * optionally keep for `tts_expressive` only if it is clearly not overlap but an intentional non-verbal event
* Store `overlap_suspected=true` outside the model response when diarization or validator heuristics say so.

**Why:** TTS hates overlap. ASR can survive some of it.

---

### 8) Boundary cleanup and sentence polishing before Gemini

This is the big one.

**Decision:** Use a **boundary-salvage trimmer** before inference. Do not feed Gemini raw clipped edges whenever you can avoid it.

**TODO**

1. Process by **video tar**, not by random segment.

   * lease one `videoID.tar`
   * download once
   * extract once
   * process all segments inside
2. For each segment, compute:

   * leading speech-at-boundary flag
   * trailing speech-at-boundary flag
   * short-window VAD/energy profile
3. If the segment already has clean silence at both sides, keep it.
4. If speech starts immediately at `t=0`, search the first ~1.0 to 1.5s for the earliest usable pause / low-energy trough / silence-to-speech transition and trim forward.
5. If speech ends at the last frame, search the last ~1.0 to 1.5s for the latest usable pause / speech-to-silence transition and trim backward.
6. After trim, add **~150 to 200 ms of digital silence padding** on both ends for Gemini input.
7. Store:

   * original offsets
   * trimmed offsets
   * pad applied
   * `truncated_start`
   * `truncated_end`

**My opinion on the silence amount:**
100 ms is a little too aggressive. I’d target **150 to 200 ms** of pad. It is still cheap in audio-token terms and gives cleaner edge context.

**Hard rule**

* If you cannot salvage a clean-ish start or end and the clip is clearly mid-sentence on both sides, **drop it from TTS**.
* It can still remain ASR-eligible if the transcript is strong.

This is why I want two lanes. One cleanup policy cannot serve both ASR and TTS equally well.

Gemini can accept FLAC and WEBM audio, and audio is tokenized by duration at 32 tokens/second, so adding a tiny pad is basically free. ([Google Cloud Documentation][6])

---

### 9) Segment length policy

**Decision:** Your instincts are right, but I’d make them lane-specific.

**TODO**

* **Global hard reject:** `< 2.0s`
* **ASR lane:** `2.0s to 15.0s`
* **TTS clean lane:** `2.5s to 12.0s`
* **Preferred split target:** cut around `8 to 10s`
* **Stretch limit:** up to `15s` only if the next good boundary is close

**Why**

* Sub-2s clips are disproportionately acknowledgements, clipped fragments, or junk.
* TTS likes slightly longer, more self-contained utterances.
* ASR can benefit from shorter and partial segments.
* With Gemini audio costing by duration and 32 tokens/sec, keeping the median clip near 8 to 10s is also good for quota headroom. ([Google AI for Developers][2])

---

### 10) Prompt changes I would make immediately

Here’s the prompt surgery, not the final rewritten prompt yet.

**TODO**

* Change `TARGET:` to `EXPECTED_LANGUAGE_HINT:`
* Remove the giant embedded JSON schema from the prompt
* Add explicit no-speech behavior:

  * `transcription = "[NO_SPEECH]"`
  * `tagged = "[NO_SPEECH]"`
  * `detected_language = "no_speech"`
  * `speaker = neutral / conversational / normal / ""`
* Add explicit rule:

  * `tagged` must copy `transcription` exactly and only insert allowed tags
* Add explicit number/symbol rule:

  * keep spoken form, do not normalize to digits/symbols unless literally spoken that way
* Add explicit script rule:

  * use the script of the language being spoken, not the language hint
* Remove ambiguous examples that could bias output
* Keep the prompt lean enough to avoid becoming your TPM bottleneck

**Why:**
Google’s structured output docs are pretty blunt here: schema belongs in the API parameter, not repeated in the prompt, and overly complex schemas/prompts can hurt output quality or trigger errors. ([Google Cloud Documentation][4])

---

### 11) Validation strategy without getting stuck on validation

You explicitly said not to get trapped here, which is correct. So I’d do it in **two passes**.

#### Pass A: cheap validation on everything

Run immediately after Gemini output.

**TODO**

* JSON validity
* non-empty transcript
* `% special tokens` (`[UNK]`, `[INAUDIBLE]`, `[NO_SPEECH]`)
* script/language plausibility check
* duration vs text-length plausibility
* boundary score from pre-trimmer
* event-tag count sanity
* `detected_language` vs metadata mismatch
* overlap suspicion flag
* prompt/version/model metadata persisted

This gives you a **provisional quality score** and lets the pipeline keep moving.

#### Pass B: heavy validation on low-score tail

Run later on the bottom slice, not on all 80M on day one.

**What I’d use**

1. **Language ID check:**
   Bhashini lists audio language detection models for exactly your 12-language set: Assamese, Bengali, English, Hindi, Kannada, Gujarati, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu. ([Bhashini][7])

2. **Romanization for script-agnostic text comparison:**
   Use IndicXlit or Indic NLP Library to convert Indic scripts to Roman form before CER/WER-style agreement checks. ([GitHub][1])

3. **Acoustic alignment / forced alignment:**

   * **NeMo Forced Aligner** is a real option because it works with CTC or hybrid CTC models and supports 14+ languages plus custom models. ([NVIDIA Docs][8])
   * **AI4Bharat IndicConformer** covers all 22 official Indian languages, so it is the most relevant open ASR backbone for your validation stack. ([Hugging Face][9])
   * **Torchaudio multilingual forced alignment** exists, but the relevant APIs are being deprecated, so I would not make that the foundation of a new production validator. ([PyTorch Documentation][10])
   * **MFA** can tokenize/transcribe many Indic languages, but its pretrained acoustic/G2P coverage is not where I’d want it for your main 12-language validator. I would treat MFA as optional tooling, not the backbone. ([Montreal Forced Aligner][11])

4. **Optional external validator on the tail only:**
   Google Cloud Speech-to-Text supports language-specific recognition and can return word-level confidence, so it’s a viable paid audit path for selected low-score subsets, not the whole corpus. ([Google Cloud][12])

5. **LLM review only on the bottom tail:**
   Gemini 3 Pro or 3.1 Pro can review low-score segments, but only after cheap filters. Running a premium reviewer over all 80M would be peak human behavior.

---

### 12) What metrics I would actually store

**TODO**
Store per segment:

* `video_id`
* `segment_id`
* `speaker_id`
* `original_start_ms`, `original_end_ms`
* `trimmed_start_ms`, `trimmed_end_ms`
* `leading_pad_ms`, `trailing_pad_ms`
* `expected_language_hint`
* `detected_language`
* `lang_mismatch_flag`
* `transcription`
* `tagged`
* `speaker_emotion`, `speaker_style`, `speaker_pace`, `speaker_accent`
* `num_unk`
* `num_inaudible`
* `num_event_tags`
* `boundary_score`
* `text_length_per_sec`
* `overlap_suspected`
* `quality_score_provisional`
* `quality_score_final`
* `prompt_version`
* `schema_version`
* `trimmer_version`
* `validator_version`
* `model_id`
* `temperature`
* `thinking_level`
* provider (`aistudio`, `vertex`, `openrouter`)

Version everything. Otherwise three weeks from now you’ll have a giant parquet graveyard and no clue which transcripts came from which prompt.

---

## Throughput and 4-day feasibility

Using **your stated quota** of `20K RPM` and `20M TPM`, the target is **arithmetically doable** for 80M segments in 100 hours, but only if you keep the average input size under control.

The critical math is this:

* 20K RPM across 100 hours gives you capacity for **120M requests**
* 20M TPM across 100 hours gives you **120B input tokens**
* Gemini counts audio at **32 tokens per second**, so a 10s clip contributes about **320 audio tokens** before prompt text even enters the picture. ([Google AI for Developers][2])

That means:

* if your total average input stays under about **1,500 tokens/request**, 80M requests still fits
* if you keep it under about **1,000 tokens/request**, RPM becomes the likely bottleneck instead of TPM, which is where you want to be

So yes, your current plan can finish in 4 days, **but only if**:

1. the prompt gets compressed
2. you do not waste tokens on duplicated schema
3. you keep average durations in the 8 to 10s zone
4. you run very high concurrency with sane backpressure

---

## Provider strategy

### Primary lane: Gemini standard API / online inference

**Use this first.**

Because your segments live inside `videoID.tar`, the practical worker shape is:

* lease tar
* download tar once
* extract segments locally
* trim locally
* send inline audio bytes to Gemini

Gemini supports inline audio for requests under 20 MB, which fits your segment sizes. ([Google AI for Developers][2])

### Secondary lane: Vertex batch

**Use as overflow/backlog lane.**

Vertex batch wants Cloud Storage/BigQuery oriented inputs and is async. If you want batch at scale, the clean path is to mirror the **trimmed** segments you actually intend to process into GCS and reference them there. Vertex’s online inference can also use public HTTP URLs or GCS URIs for small audio files, but batch planning is much cleaner with GCS. ([Google Cloud][13])

### Tertiary lane: OpenRouter

**Overflow only.**

OpenRouter supports:

* structured outputs
* BYOK/provider routing
* paid usage without platform-level rate limits on paid plans

But for audio it requires **base64-encoded audio**; direct URLs are not supported for audio input. So it is a decent escape hatch once you already have extracted bytes, not a magical zero-work overflow path. ([OpenRouter][14])

---

## The plan I would actually execute

### Phase 0: lock decisions

* authoritative field = code-mixed native script
* no transliteration in Gemini call
* no generic confidence field
* `detected_language` standardized
* strict `tagged` derivation
* `tts_clean` / `tts_expressive` / `asr_core` split

### Phase 1: canary

Run a **10k segment stratified canary** across:

* all 12 languages
* 2s to 15s bins
* clean vs clipped boundaries
* low/high overlap suspicion

Test:

* current prompt vs trimmed prompt
* `thinking_level=low` vs `minimal`
* event-tag precision
* language mismatch behavior

### Phase 2: lock prompt + schema + trimmer

After canary:

* freeze prompt v1
* freeze schema v1
* freeze boundary-salvage rules v1

### Phase 3: ramp production

* processing unit = `videoID.tar`
* async online Gemini calls with inline audio
* store provisional scores immediately
* reject obvious junk early
* keep low-score tail for later review

### Phase 4: overflow and audit

* use Vertex batch only for backlog chunks or when online throttles
* use OpenRouter only if both Google lanes are constrained
* run heavy validators on the bottom scoring tail, not the full firehose

---

## Final recommendations in one blunt list

* **Use code-mixed native script as the authoritative transcript.**
* **Keep numbers as spoken words, not digits.**
* **Use Supabase language as a weak prior only.**
* **Keep `detected_language`; don’t add a generic confidence field.**
* **Make `tagged` a pure copy-plus-tags derivation.**
* **Keep accent optional/empty unless very obvious.**
* **Trim boundaries aggressively before Gemini.**
* **Add ~150 to 200 ms silence padding after trimming.**
* **Split ASR and TTS acceptance criteria.**
* **Process per tar, not per segment fetch.**
* **Use inline audio for the standard path.**
* **Use batch as overflow, not the main lane.**
* **Version every component and store per-segment metrics.**
* **Do not let validation block the first full pass.**

The next thing to lock is the **prompt spec v1** and the **boundary-salvage spec v1**. After that, implementation becomes mechanical instead of philosophical, which is rare and beautiful.

[1]: https://github.com/AI4Bharat/IndicXlit "https://github.com/AI4Bharat/IndicXlit"
[2]: https://ai.google.dev/gemini-api/docs/audio "https://ai.google.dev/gemini-api/docs/audio"
[3]: https://ai.google.dev/gemini-api/docs/batch-api "https://ai.google.dev/gemini-api/docs/batch-api"
[4]: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output "https://docs.cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output"
[5]: https://ai.google.dev/gemini-api/docs/thinking "https://ai.google.dev/gemini-api/docs/thinking"
[6]: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-flash "https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-flash"
[7]: https://dibd-bhashini.gitbook.io/bhashini-apis/available-models-for-usage "https://dibd-bhashini.gitbook.io/bhashini-apis/available-models-for-usage"
[8]: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/tools/nemo_forced_aligner.html "https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/tools/nemo_forced_aligner.html"
[9]: https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual "https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual"
[10]: https://docs.pytorch.org/audio/stable/tutorials/forced_alignment_for_multilingual_data_tutorial.html "https://docs.pytorch.org/audio/stable/tutorials/forced_alignment_for_multilingual_data_tutorial.html"
[11]: https://montreal-forced-aligner.readthedocs.io/en/v3.2.0/user_guide/workflows/train_acoustic_model.html "https://montreal-forced-aligner.readthedocs.io/en/v3.2.0/user_guide/workflows/train_acoustic_model.html"
[12]: https://cloud.google.com/speech/docs/languages "https://cloud.google.com/speech/docs/languages"
[13]: https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/batch-prediction-api "https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/batch-prediction-api"
[14]: https://openrouter.ai/docs/features/structured-outputs "https://openrouter.ai/docs/features/structured-outputs"
