# Gemini Transcription Pipeline — Master Plan

> **Status**: PLAN PHASE — No implementation until validated  
> **Target**: ~80M segments across 12 languages in 4 days (~100 hours)  
> **Model**: gemini-3-flash (temp=0)

---

## Table of Contents

1. [Concern Tracker — Every Request Addressed](#1-concern-tracker)
2. [Prompt Design Decisions](#2-prompt-design)
3. [JSON Schema — Final Recommended](#3-json-schema)
4. [Audio Preprocessing Pipeline](#4-audio-preprocessing)
5. [Transcription Execution Strategy](#5-execution-strategy)
6. [Validation & Scoring](#6-validation)
7. [Throughput Math & Rate Limit Strategy](#7-throughput)
8. [Architecture Overview](#8-architecture)
9. [Open Questions / Risks](#9-risks)

---

## 1. Concern Tracker — Every Request Addressed {#1-concern-tracker}

| # | Concern | Section | Decision |
|---|---------|---------|----------|
| 1 | Review prompt.txt | §2 | **FILE NOT UPLOADED** — need re-upload. Plan proceeds on described behavior. |
| 2 | Temperature = 0 for Gemini | §2.1 | ✅ Confirmed. Correct for deterministic transcription. |
| 3 | Code-mixed vs single-script transliterated | §2.2 | **Code-mixed** (native script + Latin for English words). See reasoning. |
| 4 | Language hint strategy (pass tag or not?) | §2.3 | Pass as soft hint + `detected_language` as correction. |
| 5 | `detected_language` field behavior | §2.3 | Model returns what it actually hears; prompt instructs override behavior. |
| 6 | Punctuation always on | §2.4 | ✅ Yes. Punctuation, numbers, symbols included. Strippable later. |
| 7 | `tagged` field — audio event tags | §2.5 | Fixed list of 10 events. Insert only when confident. |
| 8 | Audio event list for podcast audio | §2.5 | Curated list provided with reasoning. |
| 9 | Re-listen/backtrack for tagged field | §2.5 | Prompt instructs: transcribe first → then annotate. |
| 10 | Speaker metadata — emotion | §2.6 | Required field. 6-class enum. |
| 11 | Speaker metadata — speaking_style | §2.6 | Required field. 7-class enum. |
| 12 | Speaker metadata — pace | §2.6 | Required field. 3-class enum. |
| 13 | Speaker metadata — accent/regional | §2.6 | **Optional** (empty string if not confident). See reasoning. |
| 14 | Audio preprocessing — silence padding | §4.1 | 150ms recommended on both ends. |
| 15 | Audio preprocessing — finding cut points | §4.2 | Energy-valley detection with VAD. |
| 16 | Audio preprocessing — abrupt starts/ends | §4.2 | Trim to first/last energy valley; accept 30-40% loss. |
| 17 | Segment length — min 2s, preferred 10s, max 15s | §4.3 | Enforced with tiered logic. |
| 18 | Segment length — finding cut point in 7-15s range | §4.3 | Sliding window energy-valley search. |
| 19 | Segments < 2s — remove/ignore | §4.3 | Discarded. Logged for stats. |
| 20 | Transcript validation strategy | §6.1 | Multi-signal scoring. |
| 21 | MFA aligners for Indic languages | §6.2 | Limited support. Alternative approach proposed. |
| 22 | Programmatic romanization | §6.3 | `aksharamukha` / `indic-transliteration` libraries. |
| 23 | CTC/G2P models for Indic validation | §6.4 | AI4Bharat IndicWav2Vec + Whisper cross-check. |
| 24 | Scoring mechanism & re-run strategy | §6.5 | Score → store in Supabase → re-run low scores. |
| 25 | Scale: ~80M segments | §7.1 | Feasibility confirmed with caveats. |
| 26 | Rate limits: AI Studio 20K RPM / 20M TPM | §7.2 | Primary channel. ~67-95 hrs at 70-100% utilization. |
| 27 | Rate limits: Vertex 10B tokens inflight batch | §7.3 | Secondary channel. Async batch processing. |
| 28 | OpenRouter as last resort | §7.4 | Overflow only. Paid. |
| 29 | 4-day / 100-hour timeline | §7.5 | Tight but feasible with parallel AI Studio + Vertex Batch. |
| 30 | Data format: videoID.tar → metadata.json + segments/*.flac | §8 | Pipeline reads tar, extracts, processes, uploads results. |
| 31 | R2 as source storage | §8 | Stream from R2, write results back to R2 + Supabase. |
| 32 | Supabase metadata with language tags | §8 | Query language_id per videoID before prompting. |
| 33 | 12 languages supported | §2.3 | All handled via language hint + detected_language. |

---

## 2. Prompt Design Decisions {#2-prompt-design}

### 2.1 Temperature = 0 ✅

**Decision**: Use `temperature: 0`.

**Reasoning**: Google's caution about temp=0 applies to reasoning, math, and creative tasks where sampling diversity helps the model explore solution spaces. For transcription, we want the single most likely token sequence — determinism is exactly right. This is effectively a perception task, not a generation task. The model should report what it hears, not "explore" alternatives.

**Additional settings**:
- `top_p: 1.0` (no nucleus sampling restriction)
- `top_k: 1` (greedy decoding, reinforces determinism)
- `max_output_tokens: 2048` (sufficient for JSON response of a 15s segment)
- `response_mime_type: "application/json"` (enforce JSON output in Gemini)
- `response_schema: <your_schema>` (Gemini's structured output mode — use this, it dramatically improves schema compliance)

### 2.2 Code-Mixed vs Single-Script Transliterated

**Decision**: **Code-mixed** — native script for the primary language, Latin script for English/foreign words.

**Reasoning**:

1. **Accuracy of THIS call is paramount** (your words). Gemini is most accurate when it can naturally represent what it hears. Asking it to transliterate English words like "machine learning" into Telugu script ("మెషిన్ లెర్నింగ్") introduces a transliteration step where errors compound. The model might mishear AND mistransliterate.

2. **Indian language speakers naturally code-mix**. In podcasts especially, English technical terms, brand names, and filler words appear constantly. Code-mixed representation is the ground truth of how these languages are spoken.

3. **You're doing a second LLM call anyway** for script conversion. That call is text-only (cheap, fast, reliable). Let the expensive audio-LLM call focus purely on accurate perception, not script gymnastics.

4. **For TTS training**, you'll want the text to match what the speaker actually said. If someone says "machine learning" in English mid-Telugu-sentence, that's what the TTS should learn to produce. Code-mixed text preserves this.

**Prompt instruction**: "Transcribe in the native script of the spoken language. When the speaker uses English or other foreign words mid-sentence, write those words in Latin script. Do not transliterate foreign words into the native script."

**Example output**: `"అతను machine learning గురించి చెప్పాడు"` (NOT `"అతను మెషిన్ లెర్నింగ్ గురించి చెప్పాడు"`)

### 2.3 Language Hint Strategy

**Decision**: Pass language as a **soft hint** in the prompt, and use `detected_language` as the model's correction mechanism.

**Reasoning**: 
- **Without any hint**: Gemini sometimes misidentifies closely related languages (Hindi/Marathi, Telugu/Kannada script confusion). Having a hint anchors the model.
- **With a hard constraint**: If you say "this is Telugu" but the speaker is actually speaking Hindi in this segment (diarization error or guest speaker), the model will force-fit Hindi words into Telugu script = garbage.
- **Soft hint is the sweet spot**: "The expected language is Telugu, but transcribe what you actually hear."

**Prompt instruction**: 
```
Expected language: {language_tag}
This is a hint from metadata. The speaker may use a different language or mix languages. 
Transcribe exactly what you hear. Report the actual detected language in detected_language field.
If the detected language differs significantly from the expected language, still transcribe accurately — do not force the expected language.
```

**`detected_language` field**: BCP-47 tag (e.g., `hi`, `te`, `en`, `te-en` for code-mixed Telugu-English). This allows downstream filtering of segments where metadata was wrong.

### 2.4 Punctuation — Always On ✅

**Decision**: Full punctuation, numbers as digits, standard symbols.

**Reasoning**: You can always strip punctuation for training (trivial regex). You cannot recover punctuation later without another expensive model call or manual annotation. Including punctuation also helps the model produce more coherent transcriptions since punctuation is part of its language model.

**Rules for the prompt**:
- Use native punctuation marks (। for Hindi/Devanagari, etc.)
- Numbers as spoken: "तीन सौ" or "300" — prefer digits when the speaker clearly says a number
- Question marks, exclamation points, commas as appropriate
- No sentence-final period for sentences that are clearly cut off mid-speech (addresses your abrupt-cut concern)

### 2.5 Audio Event Tags

**Decision**: Fixed list of 10 tags. Insert only when the model is confident. Two-pass mental model in the prompt.

**Curated tag list for YouTube podcast audio**:

| Tag | Description | Why included |
|-----|-------------|--------------|
| `[laugh]` | Laughter (speaker) | Extremely common in podcasts |
| `[chuckle]` | Brief/soft laugh | Distinct from full laugh; common in conversational podcasts |
| `[cough]` | Coughing | Common involuntary sound |
| `[sigh]` | Audible sigh | Emotional marker, useful for TTS expressiveness |
| `[breath]` | Audible inhale/exhale | Common in close-mic podcast recording |
| `[sniff]` | Sniffing | Occasional but distinct |
| `[throat_clear]` | Throat clearing | Very common in podcasts, especially at segment starts |
| `[music]` | Background music / jingle | Intros, outros, transitions |
| `[noise]` | Non-speech background noise | Catch-all for ambient sounds |
| `[overlap]` | Another speaker's voice bleeding in | Addresses your diarization imperfection concern |

**Tags I considered but excluded**:
- `[clap]` — Rare in 1-on-1 podcasts. If needed, `[noise]` covers it.
- `[silence]` — Better handled by the audio preprocessing stage. Including it in tags conflates audio structure with events.
- `[click]` — Too granular; often just microphone artifacts.

**Prompt instruction for tagged field**:
```
After completing the transcription, review the audio again to identify non-speech audio events.
Insert event tags from ONLY this list: [laugh], [chuckle], [cough], [sigh], [breath], [sniff], [throat_clear], [music], [noise], [overlap].
Place tags at the exact position in the transcript where they occur.
Only insert a tag if you are confident the event is clearly audible. When in doubt, omit it.
The tagged field must contain the EXACT same transcribed text as the transcription field, with only event tags added.
```

### 2.6 Speaker Metadata

**emotion** (required): `["neutral", "happy", "sad", "angry", "excited", "surprised"]`
- ✅ Good enum. These are the 6 that Gemini can reliably distinguish from audio.

**speaking_style** (required): `["conversational", "narrative", "excited", "calm", "emphatic", "sarcastic", "formal"]`
- ✅ Good enum. Note: "excited" overlaps with emotion. Consider whether you want to keep it in both or replace with something like "animated" in style. But for now, overlap is fine — emotion=excited + style=conversational is valid and distinct from emotion=neutral + style=excited.

**pace** (required): `["slow", "normal", "fast"]`
- ✅ Simple and reliable. Gemini can distinguish these from audio.

**accent** (optional — empty string if uncertain):

**Decision**: Keep as **optional** (empty string default). Do NOT make it required.

**Reasoning**: 
- Accent detection from short (2-10s) segments is unreliable, even for humans.
- Forcing the model to always produce an accent will lead to hallucinated accent labels (e.g., always defaulting to "Standard Hindi" or the most common regional accent).
- When the model IS confident (strong regional markers, 10+ seconds of clear speech), it's valuable data. When it's not, empty string is correct.
- For Indic languages, the accent space is huge (dozens of regional varieties per language). Free-text is better than an enum here.

**Prompt instruction**: "If you can confidently identify a regional accent or dialect (e.g., 'Hyderabadi Telugu', 'Bhojpuri Hindi', 'Chennai Tamil'), include it. If uncertain, return an empty string. Do not guess."

---

## 3. JSON Schema — Final Recommended {#3-json-schema}

```json
{
  "type": "object",
  "properties": {
    "transcription": {
      "type": "string",
      "description": "Code-mixed transcription in native script. English/foreign words in Latin script. Full punctuation. No audio event tags."
    },
    "tagged": {
      "type": "string",
      "description": "Same as transcription but with audio event tags inserted at occurrence positions. Tags: [laugh], [chuckle], [cough], [sigh], [breath], [sniff], [throat_clear], [music], [noise], [overlap]"
    },
    "speaker": {
      "type": "object",
      "properties": {
        "emotion": {
          "type": "string",
          "enum": ["neutral", "happy", "sad", "angry", "excited", "surprised"]
        },
        "speaking_style": {
          "type": "string",
          "enum": ["conversational", "narrative", "excited", "calm", "emphatic", "sarcastic", "formal"]
        },
        "pace": {
          "type": "string",
          "enum": ["slow", "normal", "fast"]
        },
        "accent": {
          "type": "string",
          "description": "Regional accent/dialect if confidently detected, otherwise empty string"
        }
      },
      "required": ["emotion", "speaking_style", "pace", "accent"],
      "additionalProperties": false
    },
    "detected_language": {
      "type": "string",
      "description": "BCP-47 language tag of the language actually spoken (e.g., 'hi', 'te', 'en', 'te-en' for code-mixed)"
    }
  },
  "required": ["transcription", "tagged", "speaker", "detected_language"],
  "additionalProperties": false
}
```

**Change from your original**: `accent` is now `required` in the schema but with empty-string semantics. This is because Gemini's structured output mode works better when all fields are required — the model always produces them. The prompt instructs it to use empty string when uncertain. This avoids the model sometimes including and sometimes omitting the field, which causes JSON parsing issues.

---

## 4. Audio Preprocessing Pipeline {#4-audio-preprocessing}

### 4.1 Silence Padding

**Decision**: **150ms** of silence on both ends.

**Reasoning**:
- 100ms is the bare minimum for a human ear to register silence. It works but feels tight.
- 150ms gives a comfortable buffer that mimics natural speech pauses without wasting time.
- 200ms+ starts to feel like an unnatural gap and wastes audio budget.
- This silence should be true digital silence (zeros), not low-level noise, so the model clearly perceives the sentence boundary.

**Implementation**: After trimming to clean boundaries, prepend and append 150ms of zeros at the segment's sample rate.

### 4.2 Finding Clean Cut Points (Abrupt Start/End Handling)

This is the most critical preprocessing step. Here's the strategy:

**Step 1: Compute frame-level energy envelope**
- Use librosa to compute RMS energy in ~20ms frames (hop_length=320 at 16kHz).
- Smooth with a small moving average (50ms window) to avoid micro-fluctuations.

**Step 2: Classify segment boundaries**

For the **start** of the segment:
- Check if the first 50ms has energy below a threshold (e.g., -40dB relative to segment peak).
- If YES → clean start. Keep as-is.
- If NO → abrupt start. The segment begins mid-speech.
  - Scan forward to find the first "energy valley" — a local minimum where energy drops below 20% of the segment's mean energy for at least 80ms.
  - This valley likely corresponds to a pause between words or sentences.
  - Trim everything before this valley. The new start is the beginning of the next sentence/phrase.

For the **end** of the segment:
- Same logic in reverse. Check last 50ms.
- If abrupt → scan backward to find the last energy valley.
- Trim everything after this valley.

**Step 3: Validate remaining segment**
- After trimming, if the remaining audio is < 2s → discard.
- If remaining audio is > 15s → apply the length-based splitting from §4.3.

**Accepting the 30-40% loss**: This is the right tradeoff. 60-70% of high-quality, cleanly-bounded segments is infinitely more valuable for TTS/ASR training than 100% of noisy, abruptly-cut segments. The discarded portions would cause hallucination in Gemini and garbage in your training data.

### 4.3 Segment Length Strategy

```
Segment Length → Action
──────────────────────────────────
< 2s           → DISCARD (log to stats)
2s - 10s       → KEEP as-is (after silence padding)
10s - 15s      → TRY to find a cut point in 7-10s range
                  If found → split into two segments
                  If not found → keep as single segment up to 15s
> 15s          → MUST split. Find energy valleys and split.
                  If no valley found → force-cut at 12s (accept quality loss, flag for review)
```

**Finding cut points for long segments**:
- Compute energy envelope as in §4.2.
- Identify all energy valleys (local minima below 15% of mean energy, sustained for 60ms+).
- Prefer valleys closest to the 8-10s mark (optimal Gemini context length).
- If multiple valleys exist, pick the deepest one in the preferred range.

**Why 10s preferred, 15s max**:
- Gemini Flash processes audio in chunks. Shorter segments = less chance of attention drift or mid-segment hallucination.
- 10s is roughly 2-3 sentences in conversational Indian language speech (typical speaking rate of 120-160 words/min).
- 15s is the absolute upper bound before transcription quality degrades noticeably.

---

## 5. Transcription Execution Strategy {#5-execution-strategy}

### 5.1 Overall Pipeline Flow

```
R2 (videoID.tar)
    ↓
[1] Download tar → extract metadata.json + segments/*.flac
    ↓
[2] Query Supabase → get language_tag for videoID
    ↓
[3] Audio Preprocessing (§4)
    - Silence detection & boundary cleanup
    - Length-based splitting
    - 150ms silence padding
    - Discard < 2s segments
    ↓
[4] Batch Formation
    - Group segments into batches
    - Each request: 1 audio segment + prompt + language hint
    ↓
[5] Dispatch to Gemini
    - Primary: AI Studio (real-time, 20K RPM)
    - Secondary: Vertex Batch (async, 10B token pool)
    - Overflow: OpenRouter (paid, only if needed)
    ↓
[6] Response Parsing & Validation (§6)
    - Parse JSON, validate schema
    - Compute quality scores
    - Store in Supabase
    ↓
[7] Write results back to R2
    - videoID_transcripts.json (or per-segment JSONs in a tar)
```

### 5.2 Audio Format for Gemini

**Decision**: Send as FLAC (your current format). Gemini natively supports FLAC.

- No format conversion needed.
- FLAC is lossless and smaller than WAV.
- Ensure sample rate is 16kHz mono. If segments are at higher sample rates, downsample to 16kHz before sending. This saves tokens (Gemini charges per second of audio, and higher sample rates don't improve its ASR).
- Gemini audio token cost: ~32 tokens per second of audio at 16kHz.

### 5.3 Prompt Template

```
You are a precise multilingual audio transcription system for Indian languages.

TASK: Transcribe the provided audio segment accurately.

LANGUAGE HINT: {language_tag}
This is metadata from the source. The speaker may use a different language or code-mix. 
Transcribe exactly what you hear. If the actual language differs, report it in detected_language.

TRANSCRIPTION RULES:
1. Write in the native script of the spoken language.
2. When the speaker uses English or other foreign words mid-sentence, write those words in Latin script. Do not transliterate foreign words into native script.
3. Include full punctuation (native punctuation marks: ।, ?, !, etc.), commas, and question marks.
4. Write numbers as digits when clearly spoken as numbers.
5. If the audio is unintelligible or contains only noise, return an empty transcription.
6. If the audio starts or ends mid-word, transcribe only the complete words you can clearly hear.

AUDIO EVENT TAGS (for the tagged field only):
After transcribing, re-examine the audio for non-speech events.
Insert tags from ONLY this fixed list: [laugh], [chuckle], [cough], [sigh], [breath], [sniff], [throat_clear], [music], [noise], [overlap]
- Place tags at the exact position where they occur in the speech.
- Only insert if the event is clearly and confidently audible. Omit if uncertain.
- The tagged field must contain the IDENTICAL transcription text with only these tags inserted.

SPEAKER ANALYSIS:
- emotion: Classify the overall emotional tone of the speaker.
- speaking_style: Classify how the speaker is delivering the content.
- pace: Classify the speaking speed.
- accent: If you can confidently identify a regional accent or dialect (e.g., "Hyderabadi Telugu", "Bhojpuri Hindi", "Chennai Tamil"), include it. Otherwise, return empty string "".

OUTPUT: Respond with valid JSON matching the provided schema. Nothing else.
```

**Note**: This is a starting template. Once you re-upload prompt.txt, I'll merge your crafted prompt with these adjustments.

---

## 6. Validation & Scoring {#6-validation}

### 6.1 Multi-Signal Validation Strategy

No single validation method works perfectly for all 12 Indic languages. Instead, we use **multiple weak signals** and combine them into a composite score.

| Signal | Method | Languages | Speed | Weight |
|--------|--------|-----------|-------|--------|
| Schema compliance | JSON parse + field validation | All | Instant | Pass/Fail gate |
| Empty/too-short check | Character count vs audio duration | All | Instant | Pass/Fail gate |
| Whisper cross-reference | Run Whisper-large-v3 on same audio, compare | All 12 + en | ~0.3x realtime on GPU | 0.35 |
| IndicWav2Vec CTC score | AI4Bharat CTC model log-likelihood | Hi, Bn, Ta, Te, Mr, Gu, Ml, Kn, Or, Pa | ~0.1x realtime on GPU | 0.25 |
| Language ID check | Compare detected_language vs metadata | All | Instant | 0.10 |
| Character/script consistency | Verify text uses expected Unicode blocks | All | Instant | 0.15 |
| Duration-to-text ratio | Words per second sanity check (typical: 2-4 words/s) | All | Instant | 0.15 |

### 6.2 MFA Aligners for Indic Languages

**Status**: Limited but usable.

- **Montreal Forced Aligner (MFA)**: Has acoustic models for Hindi and Bengali. Community models exist for some others but quality varies. NOT practical for all 12 languages at scale.
- **AI4Bharat alignment models**: Better coverage for Indian languages but still not all 12.
- **Recommendation**: MFA is too slow and language-limited for 80M segments. Use it only for spot-checking a sample of flagged segments during quality review, not as a pipeline stage.

### 6.3 Programmatic Romanization

**Available tools**:

1. **`aksharamukha`** (Python): Best coverage for Indian scripts. Supports all 12 of your languages. Can convert any Indic script ↔ Latin (IAST, ISO 15919, etc.).
   ```
   pip install aksharamukha
   ```

2. **`indic-transliteration`** (Python, by sanskrit-coders): Good for Devanagari-family scripts.

3. **AI4Bharat `IndicXlit`**: Neural transliteration model. Higher quality but slower. Overkill for validation purposes.

**Use case**: Convert Gemini's native-script output to romanized form, then compare with Whisper's romanized output. This normalizes script differences and focuses the comparison on phonetic content.

### 6.4 CTC/G2P Models for Indic Validation

**AI4Bharat IndicWav2Vec**: CTC-based ASR models available for Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Malayalam, Kannada, Odia, Punjabi. This covers 10 of your 12 languages (missing Assamese, which you can approximate with Bengali model).

**How to use for validation**:
1. Run IndicWav2Vec CTC on the audio → get CTC log-probabilities for the predicted transcript.
2. Use Gemini's transcript as the reference. Force-align with CTC and compute the alignment score.
3. Low alignment score = Gemini's transcript doesn't match what the CTC model hears = flag for review.

**Whisper-large-v3**: Supports all your languages. Run it as a second opinion transcriber. Compute normalized edit distance (after romanization) between Gemini and Whisper outputs. High divergence = one of them is wrong = flag.

### 6.5 Scoring & Re-Run Strategy

**Composite Score (0-100)**:

```
score = (
    0.35 * whisper_similarity +      # Normalized edit distance → similarity
    0.25 * ctc_alignment_score +      # IndicWav2Vec forced alignment
    0.15 * script_consistency_score + # Unicode block check
    0.15 * duration_ratio_score +     # Words-per-second sanity
    0.10 * language_match_score       # detected_language vs metadata
)
```

**Storage**: Write to Supabase per-segment:
```
segment_id | video_id | language | score | whisper_sim | ctc_score | 
flagged | transcript_json | created_at
```

**Re-run strategy**:
- Score < 40 → **Discard** (likely garbage audio or wrong speaker).
- Score 40-65 → **Re-run with Gemini Pro** (superior model, worth the cost for uncertain segments).
- Score 65-80 → **Accept with flag** (usable but review in batch).
- Score > 80 → **Accept** (high confidence).

**Phase 1 (days 1-4)**: Transcribe all 80M segments with Gemini Flash. Compute only the instant metrics (schema, length, script, duration ratio, language match). Store all results.

**Phase 2 (days 5-7)**: Run Whisper + IndicWav2Vec on a stratified sample (~1-5% = 800K-4M segments) to calibrate the composite score. Then run on flagged segments only.

**Phase 3**: Re-transcribe low-scoring segments with Gemini Pro or manual review.

> **Your instruction noted**: "Don't be stuck at the validation part, can be tweaked during development phase." — Agreed. Phase 1 focuses purely on transcription throughput. Validation signals are computed lazily/async.

---

## 7. Throughput Math & Rate Limit Strategy {#7-throughput}

### 7.1 Scale Confirmation

- **Input**: ~80M segments after preprocessing
- **Average segment duration**: ~6s (estimated from your VAD + 10s-max splitting)
- **Average audio tokens per segment**: 6s × 32 tokens/s = ~192 tokens
- **Prompt tokens**: ~500 tokens (system prompt + schema)
- **Output tokens**: ~200-350 tokens (JSON response)
- **Total tokens per request**: ~900-1050 tokens (let's use 1000)

### 7.2 AI Studio (Primary Channel)

| Metric | Value |
|--------|-------|
| RPM | 20,000 |
| TPM | 20,000,000 |
| RPD | Unlimited |
| Requests/hour | 1,200,000 |
| Requests/day | 28,800,000 |
| Tokens/request | ~1,000 |
| TPM bottleneck | 20M / 1000 = 20K RPM ← matches RPM limit |

**At 100% utilization**: 80M / 1.2M per hour = **66.7 hours** ✅

**At 70% utilization** (realistic with retries, errors, backpressure): 80M / 840K per hour = **95.2 hours** ⚠️ Tight.

**At 60% utilization**: 80M / 720K per hour = **111 hours** ❌ Exceeds 100-hour target.

**Conclusion**: AI Studio alone is risky. We need Vertex Batch running in parallel.

### 7.3 Vertex AI Batch (Secondary Channel)

| Metric | Value |
|--------|-------|
| Inflight token limit | 10,000,000,000 (10B) |
| At 1000 tokens/request | ~10M concurrent requests |
| Typical batch turnaround | 1-24 hours depending on batch size |

**Strategy**: 
- Submit large batches (1M-5M segments each) to Vertex Batch API.
- These process asynchronously while AI Studio handles real-time flow.
- Vertex Batch has lower priority but no RPM limits — it processes as fast as Google's infra allows.
- Submit first batches on Day 1, results back by Day 2-3.

**Recommended split**:
- **Vertex Batch**: 50M segments (submitted in 10 batches of 5M)
- **AI Studio real-time**: 30M segments (streaming at 20K RPM)
- **OpenRouter**: 0 segments (reserve for emergencies)

### 7.4 OpenRouter (Overflow)

- Use only if AI Studio hits unexpected throttling or Vertex Batch has delays.
- Gemini Flash via OpenRouter is available but costs money.
- Keep as emergency reserve. Pre-configure the integration so it can be activated instantly.

### 7.5 Four-Day Execution Plan

```
Day 0 (Prep):
  - Finalize prompt & schema
  - Build audio preprocessing pipeline
  - Build dispatch infrastructure (async workers)
  - Test on 100 segments end-to-end
  
Day 1:
  - Submit Vertex Batch #1-#4 (20M segments)
  - Start AI Studio streaming (target: 7-8M segments processed)
  - Monitor error rates, adjust concurrency
  
Day 2:
  - Submit Vertex Batch #5-#8 (20M segments)
  - Continue AI Studio streaming (cumulative: 15-16M)
  - Vertex Batch #1-#2 results arriving → validate, re-queue failures
  
Day 3:
  - Submit Vertex Batch #9-#10 (10M segments)
  - Continue AI Studio streaming (cumulative: 23-24M)
  - Most Vertex batches completing
  - Start instant-metric validation on completed segments
  
Day 4:
  - AI Studio streaming wraps up remaining (cumulative: 30M)
  - All Vertex batches should be complete (50M)
  - Handle retries for failed segments
  - Run validation metrics, flag low-scoring segments
  - Total: 80M ✓
```

### 7.6 Concurrency Architecture

For AI Studio (20K RPM):
- **333 requests per second** sustained.
- Use **async Python** (aiohttp/httpx) with a semaphore of ~400-500 concurrent requests.
- Exponential backoff on 429s (rate limit) and 500s.
- Use multiple API keys if available to distribute load.

For Vertex Batch:
- Use Vertex AI Python SDK's `BatchPredictionJob`.
- Prepare input as JSONL files in GCS.
- Each file: up to 1M requests.
- Submit and poll for completion.

---

## 8. Architecture Overview {#8-architecture}

```
┌─────────────────────────────────────────────────────┐
│                    ORCHESTRATOR                       │
│  (Python async — coordinates all stages)             │
├─────────────┬─────────────┬─────────────────────────┤
│  R2 Client  │  Supabase   │  Audio Preprocessor      │
│  (download  │  (metadata  │  (librosa + silero-vad)  │
│   tars)     │   query)    │                          │
├─────────────┴─────────────┴─────────────────────────┤
│                 DISPATCH LAYER                        │
│  ┌───────────┐ ┌──────────────┐ ┌────────────────┐  │
│  │ AI Studio │ │ Vertex Batch │ │  OpenRouter    │  │
│  │ (async    │ │ (JSONL →     │ │  (fallback)    │  │
│  │  HTTP)    │ │  GCS → poll) │ │                │  │
│  └───────────┘ └──────────────┘ └────────────────┘  │
├─────────────────────────────────────────────────────┤
│              RESULT HANDLER                           │
│  - Parse JSON response                               │
│  - Validate schema                                   │
│  - Compute instant metrics                           │
│  - Write to Supabase (scores + transcripts)          │
│  - Write to R2 (transcript files)                    │
└─────────────────────────────────────────────────────┘
```

**Key components**:

1. **Tar Extractor**: Downloads videoID.tar from R2 → extracts segments/*.flac + metadata.json → queues for preprocessing.

2. **Audio Preprocessor**: 
   - Reads FLAC, computes energy envelope
   - Trims to clean boundaries (§4.2)
   - Splits long segments (§4.3)
   - Adds 150ms silence padding
   - Outputs preprocessed FLAC segments ready for Gemini

3. **Dispatcher**: 
   - Maintains a work queue of (segment_audio, language_tag, video_id, segment_id)
   - Routes to AI Studio (real-time) or prepares JSONL for Vertex Batch
   - Handles rate limiting, retries, circuit breaking

4. **Result Handler**:
   - Parses JSON, validates schema
   - Computes instant metrics (script consistency, duration ratio, language match)
   - Writes to Supabase and R2

---

## 9. Open Questions / Risks {#9-risks}

### Must Resolve Before Implementation

1. **prompt.txt not uploaded** — Need to review your actual prompt and merge with the template in §5.3. Please re-upload.

2. **Vertex Batch API access** — Confirm you have Vertex Batch enabled and a GCS bucket for staging. Batch API for Gemini Flash may have different quota than the 10B you mentioned (that might be for older models). Verify.

3. **Audio preprocessing compute** — Processing 80M segments through librosa + energy analysis needs significant compute. Estimate: ~0.05s per segment = 80M × 0.05s = ~46 days on single core. Need to parallelize across 50-100+ cores. Consider using a VM with 96 cores or distributing across multiple machines. **This is a potential bottleneck**.

4. **R2 download bandwidth** — 80M segments × ~100KB avg = ~8TB of audio. At 1Gbps, that's ~18 hours just for download. Factor this into Day 0/1 timeline.

5. **Supabase write throughput** — 80M row inserts. Supabase free tier may bottleneck at ~1K writes/sec. Pro tier can handle ~5-10K/sec. At 5K/sec: 80M / 5K = ~4.4 hours. Should be fine but monitor.

### Risks

| Risk | Impact | Mitigation |
|------|--------|------------|
| AI Studio rate limiting below advertised 20K RPM | Timeline slip | Vertex Batch absorbs overflow |
| Vertex Batch turnaround > 24hrs | Timeline slip | Submit batches early (Day 1); use AI Studio as primary |
| Audio preprocessing bottleneck | Can't feed Gemini fast enough | Pre-process in parallel on Day 0; stream results to dispatch |
| Gemini hallucination on noisy segments | Bad transcripts | Aggressive pre-filtering + validation scores |
| JSON schema violations | Lost results | Retry with same segment; Gemini structured output mode prevents most |
| Cost overrun on OpenRouter | Budget | Hard cap; use only as last resort |

---

## Summary of Key Decisions

| Decision | Choice | Reasoning |
|----------|--------|-----------|
| Temperature | 0 | Deterministic transcription |
| Script strategy | Code-mixed (native + Latin) | Maximum accuracy; convert later |
| Language hint | Soft hint + detected_language | Handles mismatches gracefully |
| Punctuation | Always on | Can strip later; can't add later |
| Audio events | 10 fixed tags, confident-only | Controlled vocabulary for TTS/ASR |
| Accent field | Optional (empty string default) | Too unreliable to require on short clips |
| Silence padding | 150ms both ends | Natural sentence boundary feel |
| Segment length | 2s min, 10s preferred, 15s max | Optimal Gemini accuracy window |
| Primary dispatch | AI Studio + Vertex Batch parallel | Neither alone is safe for 100hr target |
| Validation | Multi-signal composite score | No single method covers all 12 languages |
| Validation timing | Instant metrics during transcription; deep validation post-hoc | Don't block throughput |