Alright Bharath, I've gone through your entire message and the prompt file thoroughly. Let me address **every single concern** you raised, with reasoning for each decision. I'm organizing this as a numbered plan with clear sections.

---

## MASTER PLAN: Gemini-3-Flash Transcription Pipeline

---

### 1. TRANSCRIPTION FIELD — Code-Mixed vs Single Script

**Decision: Code-mixed (each language in its native script) ✅ — Your current approach is correct.**

**Reasoning:** Gemini's audio encoder maps phonemes → tokens, and its language model is trained with each language in its native script. When you ask it to transliterate Telugu into Roman or English into Telugu script, you're adding an extra transformation step that introduces errors. Code-mixed is the most natural output format for a multimodal model — it writes what it "hears" in the script it associates with that language. This gives you the highest fidelity transcription on the first pass.

For downstream: You mentioned you'll convert scripts later with a separate LLM call anyway. So the correct strategy is:
- **This call (expensive, has audio):** Get maximum accuracy in code-mixed native script
- **Later call (cheap, text-only):** Script normalization / transliteration as needed

**Prompt tweak:** Your prompt already says this correctly. No change needed.

---

### 2. LANGUAGE MISMATCH — Handling Expected vs Actual Language

**Decision: Pass expected language as a *hint*, but instruct Gemini to trust its ears.**

**Reasoning:** If you don't pass the language hint, Gemini occasionally misidentifies closely related languages (Hindi/Marathi, Telugu/Kannada in certain dialects). But if you pass it as a hard constraint, it force-fits transcription into the wrong language when there's a genuine mismatch.

**Strategy:**
- Pass `EXPECTED: Telugu (te-IN)` as a hint in the prompt (you already do this as "TARGET")
- Rename "TARGET" to "EXPECTED LANGUAGE" to make it less prescriptive
- Keep the `detected_language` field as your verification gate
- Add a prompt line: *"The expected language is a hint from metadata. If the audio clearly speaks a different language, trust what you hear and set detected_language accordingly."*

**Downstream use:** When `detected_language ≠ expected_language`, flag this segment. You can batch-review mismatches later. Most will be legitimate (English interview in a Telugu podcast, Hindi guest on a Kannada show).

---

### 3. TEMPERATURE AND MODEL SETTINGS

**Decision: Temperature = 0 ✅**

**Reasoning:** You're right. Despite Google's documentation about reasoning tasks, transcription is a deterministic mapping task. There's one correct transcription for any audio. Temperature = 0 minimizes hallucination and ensures reproducibility. If you reprocess the same segment, you want the same output.

**Additional settings to lock down:**
- `top_p = 1.0` (no nucleus sampling interference)
- `top_k = 1` (greedy decoding)
- `response_mime_type = "application/json"` (enforced JSON mode)
- `response_json_schema` = your schema (structured output)

---

### 4. PUNCTUATION STRATEGY

**Decision: Prosody-based punctuation, always on ✅**

**Reasoning:** You're correct — you can always strip punctuation for training, but you can't reliably add it back later without another expensive model call. Prosody-based punctuation (from audible pauses/intonation) is the right approach since Gemini can detect this from the audio signal.

**Your prompt already handles this correctly.** The rule "Insert from audible pauses/intonation only. No pause = no punctuation" is clean.

**One addition to the prompt:**
- Add: *"For numbers, write them as spoken — use digits for large numbers (2024, 500), words for small contextual numbers (ek, రెండు). Match the speaker's intent."*

This is important for TTS training later — you want numbers in the form they'll be synthesized.

---

### 5. AUDIO EVENT TAGS — Fixed List

**Decision: Curated list of 8 tags for podcast audio.**

Your current list: `[laugh] [cough] [sigh] [breath] [singing] [noise] [music] [applause]`

**My recommendation — modify to 9 tags:**

| Keep | Remove | Add |
|------|--------|-----|
| `[laugh]` | `[singing]` (rare in podcasts, and if present it's usually background music) | `[throat_clear]` (very common in podcasts) |
| `[cough]` | | `[sniff]` (common, useful for TTS naturalness) |
| `[sigh]` | | |
| `[breath]` | | |
| `[noise]` | | |
| `[music]` | | |
| `[applause]` | | |

**Final list (9):** `[laugh] [cough] [sigh] [breath] [noise] [music] [applause] [throat_clear] [sniff]`

**Reasoning:** In YouTube podcasts, throat clears and sniffs are extremely common and are useful signals for both ASR (to not hallucinate text) and TTS (for naturalness). Singing is rare in talk podcasts and overlaps confusingly with music — if someone's singing, it's either background (→ `[music]`) or not really a transcription target.

**Prompt tweak:** Add to event tags section: *"Only insert tags when the event is clearly and prominently audible. When uncertain, omit. Do NOT tag normal speech breathing — only tag [breath] for audible, notable breaths."*

This prevents Gemini from over-tagging every natural breath between words.

---

### 6. SPEAKER METADATA — Emotion, Style, Pace, Accent

**Decision on accent: Make it a required field, but allow empty string.**

**Reasoning:** 
- `emotion`, `speaking_style`, `pace` — these are well-defined enums Gemini can handle confidently. Keep as required. ✅
- `accent` — Gemini can sometimes detect regional accents (Hyderabadi Telugu vs Coastal Andhra, Mumbai Hindi vs UP Hindi), but it's unreliable for fine-grained distinctions. Making it required with empty string as default means you always get the field without forcing Gemini to hallucinate an accent it's not sure about.

**Your schema already handles this correctly** — accent is a string field (not enum), allowing empty string. Keep it.

**One concern:** The `speaking_style` enum has "excited" which overlaps with the `emotion` enum's "excited". Consider changing `speaking_style` "excited" to "energetic" to reduce ambiguity:

```
"speaking_style": ["conversational", "narrative", "energetic", "calm", "emphatic", "sarcastic", "formal"]
```

---

### 7. AUDIO PREPROCESSING — Segment Boundary Cleaning

This is critical. Here's the full strategy:

**Step 1: Detect boundary quality**
- Compute RMS energy in the first 50ms and last 50ms of each segment
- If RMS < threshold (e.g., -40dB), → boundary is clean (starts/ends in silence)
- If RMS > threshold, → boundary is dirty (starts/ends mid-speech)

**Step 2: For dirty starts (segment begins mid-speech)**
- Scan forward from the start looking for the first low-energy valley (below -35dB for ≥ 80ms)
- This indicates a pause between words/sentences
- Trim everything before this point
- If no valley found within the first 40% of the segment, **discard the entire segment** — it's likely all continuous speech from a bad cut

**Step 3: For dirty ends (segment ends mid-speech)**
- Scan backward from the end looking for the last low-energy valley
- Trim everything after this point
- Same 40% discard rule

**Step 4: Silence padding**
- Add **150ms** of silence (zero-padded) to both ends
- **Why 150ms and not 100ms:** Gemini's audio encoder uses windowed frames (~25ms with 10ms hop). 150ms gives ~15 frames of clear silence context, which is enough for the model to confidently identify speech onset. 100ms is borderline — 150ms is safer with negligible cost increase.

**Step 5: Post-trim length check**
- If remaining audio < 2s after trimming → discard
- Expected loss: 20-35% of segments (your 30-40% estimate is in the right ballpark)

**Implementation:** Use `librosa` or `pydub` for energy computation. This is fast — pure signal processing, no ML inference needed. Can process thousands of segments per second.

---

### 8. SEGMENT LENGTH MANAGEMENT

**Decision:**
- **Minimum:** 2s (discard below)
- **Sweet spot:** 5–10s (best transcription quality for Gemini)
- **Hard max:** 15s
- **Preferred cut search window:** 7–12s

**Strategy for long segments (>10s):**
1. Scan for silence/low-energy points between 7s and 12s
2. If found → cut at the longest silence within that window
3. If not found → extend search to 12–15s
4. If still not found by 15s → force cut at the lowest energy point in 10–15s range
5. Each sub-segment gets the 150ms silence padding treatment

**Why 5–10s is the sweet spot:** Gemini's audio understanding is best when it has enough context to understand prosody (>3s) but not so much that the JSON output becomes very long and the model starts losing precision (>15s). 10s at ~25 tokens/second = 250 audio tokens — well within the model's attention span.

---

### 9. TRANSCRIPT VALIDATION STRATEGY

This is the hardest part. Here's my honest assessment of each approach:

**a) MFA (Montreal Forced Aligner) for Indic languages:**
- MFA has acoustic models for Hindi and a few others, but coverage for all 12 of your languages is poor. Telugu, Kannada, Malayalam, Assamese, Odia — no pretrained MFA models.
- Training MFA models requires lexicons you don't have.
- **Verdict: Not viable as a primary validation method for all 12 languages.**

**b) Cross-validation with another ASR model:**
- Use AI4Bharat's IndicWhisper or IndicConformer models as a second opinion
- These exist for most of your 12 languages
- Compute CER (Character Error Rate) between Gemini output and IndicWhisper output
- Low CER = both models agree = high confidence
- High CER = disagreement = flag for review
- **Verdict: Best practical approach. Fast inference, covers most languages.**

**c) Gemini Pro review:**
- Send the audio + Gemini Flash transcript to Gemini Pro and ask it to score accuracy
- Expensive and slow — defeats the purpose of using Flash
- **Verdict: Only for spot-checking a random sample (e.g., 0.1% of segments).**

**d) Romanization + G2P validation:**
- Libraries like `aksharamukha` or `indic-transliteration` (from AI4Bharat) can convert native scripts to Roman
- You could then use phoneme-level comparison
- But this only validates script conversion, not whether the transcription matches the audio
- **Verdict: Useful as a secondary check for script correctness, not for audio-text alignment.**

**e) Confidence scoring from output characteristics:**
- Flag segments where `detected_language ≠ expected_language`
- Flag segments with high `[UNK]` or `[INAUDIBLE]` density
- Flag segments where transcription is suspiciously short relative to audio duration (possible hallucination/omission)
- Flag segments where transcription is suspiciously long (possible hallucination/addition)
- Compute characters-per-second ratio — each language has a natural speaking rate range
- **Verdict: Free, fast, catches obvious failures. Should be your first-pass filter.**

**Recommended validation stack:**
1. **Tier 1 (free, all segments):** Heuristic scoring — length ratios, UNK density, language match, chars/sec
2. **Tier 2 (cheap, all segments):** Cross-validate with IndicWhisper CER on a per-segment basis
3. **Tier 3 (expensive, flagged segments only):** Gemini Pro re-transcription for segments that failed Tier 1+2
4. **Store all scores in Supabase** alongside segment metadata for later filtering

**Don't block on validation — run it as a parallel pipeline** after transcription completes.

---

### 10. PROMPT REFINEMENTS

After reviewing your `prompt.txt`, here are the specific tweaks I'd make:

**a)** Change "TARGET: Telugu (te-IN)" to be a template variable: `EXPECTED LANGUAGE: {language} ({locale})`

**b)** Add explicit instruction about language hint behavior:
```
The EXPECTED LANGUAGE is a metadata hint. Trust what you HEAR. If the audio 
is clearly a different language, transcribe in that language and set 
detected_language accordingly. The hint helps with ambiguous cases only.
```

**c)** Tighten the tagged field instruction:
```
"tagged" must be CHARACTER-IDENTICAL to "transcription" except for inserted 
event tags. Do not re-listen or re-interpret. Copy transcription verbatim, 
then insert tags at the positions where events occur.
```

**d)** Add number handling rule (as mentioned in #4).

**e)** Add to audio events section:
```
Do NOT tag normal inter-word breathing. Only tag [breath] for audible, 
notable breaths or gasps. When uncertain about any event, omit the tag.
```

**f)** Your current prompt has a typo: "AUTHORITATIV E" → "AUTHORITATIVE"

**g)** Make the prompt a template with `{language}`, `{locale}`, `{script_name}` variables so you can generate per-language prompts programmatically from one template.

---

### 11. SCALE, RATE LIMITS & EXECUTION PLAN

Let's do the math properly.

**Token budget per request:**
- Audio tokens: ~25 tokens/sec × 8s avg = ~200 tokens
- System prompt: ~700 tokens (your prompt is long)
- User prompt: ~30 tokens
- Output JSON: ~200-250 tokens
- **Total: ~1,150 tokens per request**

**AI Studio limits:**
- 20K RPM, 20M TPM
- Effective RPM from TPM: 20M / 1,150 ≈ **17,400 RPM** (TPM is the bottleneck)
- Per day: 17,400 × 60 × 24 ≈ **25M requests/day**
- 4 days: **~100M requests** — covers your 80M with headroom

**But** — sustained 17K RPM for 96 hours straight requires:
- Zero downtime
- Perfect parallelism
- No rate limit backoff
- Realistically expect 60-70% utilization → **60-70M requests in 4 days from AI Studio alone**

**Vertex AI Batch:**
- 10B inflight tokens → 10B / 1,150 ≈ **8.7M concurrent requests** in a batch
- Batch jobs run asynchronously — submit and wait
- Perfect for overflow and for languages where you have large volumes
- Submit batches for the highest-volume languages (Hindi, Telugu, Tamil)
- **Throughput: depends on Google's backend, but 10-20M/day is realistic for batch**

**OpenRouter (paid fallback):**
- Use ONLY if AI Studio + Vertex are both saturated
- Route to `google/gemini-2.5-flash` (or whatever the flash model is available as)
- Set a daily spend cap to control costs

**Execution strategy:**

| Priority | Channel | Target | Segments |
|----------|---------|--------|----------|
| 1 | AI Studio (real-time) | Sustain 15K RPM | ~60M in 4 days |
| 2 | Vertex Batch | Batch jobs for large languages | ~15-20M |
| 3 | OpenRouter | Overflow only | Remaining if any |

**Concurrency architecture:**
- Use `asyncio` + `aiohttp` with semaphore limiting
- 500-1000 concurrent connections to AI Studio
- Exponential backoff on 429s
- Per-language queues so you can prioritize/pause by language
- Progress tracking in Supabase (segment_id → status, transcript, scores)

---

### 12. DATA FLOW ARCHITECTURE

```
R2 (videoID.tar)
  → Untar: metadata.json + segments/*.flac
  → Supabase lookup: get language_id for videoID
  → Audio Preprocessing:
      1. RMS boundary check
      2. Trim dirty boundaries
      3. Re-segment if >10s (find cut points 7-12s)
      4. Discard if <2s
      5. Pad 150ms silence
      6. Export as FLAC (keep FLAC, Gemini supports it)
  → Gemini Transcription:
      1. Build prompt from template (inject language)
      2. Encode audio as base64 or inline_data
      3. Send to AI Studio / Vertex Batch / OpenRouter
      4. Parse JSON response
      5. Store transcript + metadata in Supabase
  → Validation (parallel/async):
      1. Heuristic scoring (chars/sec, UNK density, language match)
      2. IndicWhisper cross-validation CER
      3. Store scores in Supabase
  → Post-processing (later phase):
      1. Filter by quality scores
      2. Re-process low-scoring segments with Gemini Pro
      3. Script normalization LLM call
```

---

### 13. COMPLETE TODO CHECKLIST

Here's every concern mapped to a task:

| # | Concern | Decision | Status |
|---|---------|----------|--------|
| 1 | Code-mixed vs single script | Code-mixed native script ✅ | Decided |
| 2 | Language mismatch handling | Pass as hint, trust ears, detected_language as gate | Decided |
| 3 | Temperature setting | 0, with top_k=1, top_p=1 | Decided |
| 4 | Punctuation strategy | Prosody-based, always on, add number handling | Decided |
| 5 | Audio event tag list | 9 tags (replaced singing with throat_clear, added sniff) | Decided |
| 6 | Tagged field derivation | Character-identical copy + tags only | Decided |
| 7 | Speaker metadata fields | Keep all, change "excited" → "energetic" in style enum | Decided |
| 8 | Accent field required? | Required but allow empty string (current approach) | Decided |
| 9 | Audio boundary cleaning | RMS detection + trim to nearest valley + 150ms padding | Decided |
| 10 | Segment length limits | Min 2s, sweet 5-10s, max 15s, cut search 7-12s | Decided |
| 11 | Transcript validation - MFA | Not viable for all 12 languages | Decided |
| 12 | Transcript validation - cross-ASR | IndicWhisper CER as Tier 2 | Decided |
| 13 | Transcript validation - heuristic | Chars/sec, UNK density, language match as Tier 1 | Decided |
| 14 | Transcript validation - Gemini Pro | Spot-check only (0.1% sample) | Decided |
| 15 | Romanization validation | Aksharamukha for script check, secondary only | Decided |
| 16 | Prompt template-ization | Make language/locale/script template variables | To implement |
| 17 | Prompt fixes | Typo fix, language hint rewording, number rule | To implement |
| 18 | Rate limit math | AI Studio ~60M, Vertex ~20M, OpenRouter overflow | Decided |
| 19 | 4-day timeline feasibility | Feasible with AI Studio + Vertex parallel | Decided |
| 20 | Async architecture | asyncio + semaphore + per-language queues | To implement |
| 21 | R2 → preprocessing pipeline | Untar, RMS check, trim, re-segment, pad, export | To implement |
| 22 | Supabase schema for results | Segment transcripts + scores + metadata | To implement |
| 23 | Batch vs real-time routing | Large languages to Vertex batch, rest to AI Studio | To implement |
| 24 | OpenRouter integration | Paid fallback with spend cap | To implement |
| 25 | Score storage for later filtering | All validation scores in Supabase | To implement |

---

### 14. OPEN QUESTIONS FOR YOU

Before we move to implementation, I want your input on a few things:

1. **IndicWhisper for validation** — Do you already have IndicWhisper/IndicConformer models deployed, or should I plan GPU compute for running these? They need GPU inference.

2. **Supabase schema** — Do you want me to design the full schema for storing transcripts + scores, or do you already have a segments table?

3. **R2 access pattern** — Are you pulling tars from R2 to a local/cloud machine for preprocessing, or do you want to stream segments directly? This affects the preprocessing architecture.

4. **Vertex AI Batch** — Do you already have a GCP project with Vertex AI enabled and the batch prediction API set up?

5. **Priority order of languages** — Should we process all 12 languages in parallel, or do you want to prioritize certain languages (e.g., Hindi and Telugu first)?

6. **The `speaking_style` "excited" → "energetic" rename** — Are you fine with this, or do you want to keep it as-is?

---

That's every single concern addressed. Let me know which points you want to debate or iterate on, and then we can move to implementation.