---
name: Gemini Transcription Pipeline
overview: Build a large-scale audio transcription pipeline using Gemini 3 Flash to process ~80M audio segments across 12 Indian languages + English, with audio preprocessing, structured JSON output, multi-tier validation, and a 3-backend strategy (AI Studio Realtime primary, Batch secondary, OpenRouter overflow) to complete within 4 days.
todos:
  - id: setup-env
    content: "Project setup: venv, requirements.txt (google-genai, boto3, soundfile, librosa, numpy, asyncio, httpx, supabase-py, pydantic), .gitignore, README.md"
    status: pending
  - id: config-module
    content: "Config module: load .env, define 12 language mappings (lang -> ISO code, script name, Unicode ranges), detected_language enum"
    status: pending
  - id: r2-client
    content: "R2 client: lease videoID.tar from Supabase -> download from R2 -> extract metadata.json + segments/*.flac -> release lock. Tar is the processing unit."
    status: pending
  - id: audio-preprocess
    content: "Audio preprocessing: RMS-based boundary trimming, 150ms silence padding, length filtering (2-15s), re-splitting segments over 10s at silence valleys, store trim metadata"
    status: pending
  - id: prompt-builder
    content: "Prompt builder: lean parameterized system prompt (no schema in text), JSON schema via Pydantic with ISO lang enum + energetic rename, number-as-digits rule, NO_SPEECH handling"
    status: pending
  - id: gemini-realtime
    content: "Gemini Realtime backend (PRIMARY): async client with asyncio + semaphore (500-1000 concurrent), inline audio bytes, exponential backoff on 429s, per-language queues"
    status: pending
  - id: gemini-batch
    content: "Gemini Batch backend (SECONDARY): create JSONL files with inline base64 audio, upload via File API, submit batch jobs for overflow/backlog, poll for results"
    status: pending
  - id: openrouter-client
    content: "OpenRouter fallback (TERTIARY): OpenAI-compatible API, structured outputs via response_format json_schema, daily spend cap, only when Google lanes saturated"
    status: pending
  - id: validator
    content: "Tier 1 validator: empty/NO_SPEECH check, length ratio, script detection, language mismatch, tag consistency, quality_score, ASR/TTS lane eligibility flags, overlap detection"
    status: pending
  - id: db-module
    content: "Supabase DB module: comprehensive per-segment metadata (trim offsets, version tracking, provider, lane flags), transcription_jobs, transcription_flags tables"
    status: pending
  - id: pipeline-orchestrator
    content: "Pipeline orchestrator: tar-based processing unit, coordinate R2 lease -> preprocess -> realtime inference -> validation -> storage, with per-language progress tracking"
    status: pending
  - id: canary-test
    content: "Canary test: 1000 segments stratified across all 12 languages, multiple duration bins, test thinking_level low vs minimal, measure token usage, validate quality"
    status: pending
isProject: false
---

# Gemini Audio Transcription Pipeline - Comprehensive Plan

This plan addresses EVERY concern raised in [instructions.md](instructions.md) and is backed with reasoning for each decision.

---

## Concern 1: Code-Mixed vs Transliterated vs Single-Script Transcription

**Decision: Code-mixed transcription (each language in its own native script)**

Your current [prompt.txt](prompt.txt) already does this correctly ("Write Telugu words in Telugu script. Keep English words in English (Latin script)"). This is the right call because:

- Gemini's audio perception natively distinguishes languages; forcing transliteration adds an error-prone conversion step
- Code-mixed preserves maximum information (you know exactly which words are in which language)
- For TTS training, you need native script to map to correct phonemes. For ASR training, code-mixed is the ground truth of how people actually speak
- You can always convert to romanized/transliterated later with a cheap text-only LLM call, but you cannot recover script information from a romanized-only transcript
- Punctuation stays ON (comma, period, ?, !) since you can always strip it for training but cannot reliably add it later

---

## Concern 2: Language Mismatch and detected_language Handling

**Decision: Pass expected language as a soft hint; trust the model's perception**

Strategy:

- System prompt says: `Expected language: {language}. This is a hint, not a constraint. Trust what you hear.`
- `detected_language` acts as a cross-validation signal: if it differs from the expected language, flag the segment for review
- For code-mixed audio, `detected_language` reports the DOMINANT language (the one spoken for the majority of the segment)
- Post-processing: any segment where `detected_language != expected_language` gets a `language_mismatch` flag in Supabase for later batch review

This avoids both failure modes: (a) Gemini confusing languages when no hint is given, and (b) Gemini forcing everything into the wrong language when the hint is wrong.

`**detected_language` as ISO enum**: Restrict to standardized codes in the JSON schema to prevent messy free-form strings ("Telugu" vs "telugu" vs "te"):

`["hi", "mr", "te", "ta", "kn", "ml", "gu", "pa", "bn", "as", "or", "en", "no_speech", "other"]`

This makes downstream filtering and aggregation clean.

---

## Concern 3: Temperature Setting

**Decision: Use temperature=0 for deterministic transcription.**

Transcription is a deterministic mapping task - there is one correct transcription for any audio. temperature=0 minimizes hallucination and ensures reproducibility.

**Risk note**: Gemini 3 docs recommend temp=1.0 and warn that lower temperatures "may lead to unexpected behavior, such as looping or degraded performance." This warning applies mainly to reasoning/math tasks. For transcription with structured output, temp=0 is standard practice. **The canary test (1000 segments) MUST monitor for output looping** - if any looping is detected, we revert to temp=1.0 immediately.

Additional settings:

- `top_p = 1.0` (no nucleus sampling interference)
- `top_k = 1` (greedy decoding)
- `candidate_count = 1`
- `response_mime_type = "application/json"`
- `response_json_schema` = Pydantic-generated schema

---

## Concern 4: Thinking Level

**Decision: `thinking_level: "low"` for Gemini 3 Flash (benchmark `minimal` in canary)**

From Gemini 3 docs: `low` = "Minimizes latency and cost. Best for simple instruction following, chat, or high-throughput applications." This is exactly transcription.

Reasoning for `low` over `minimal`:

- Transcription is perception + encoding, but audio perception of noisy/accented/multilingual speech still benefits from some internal processing
- `minimal` = essentially no thinking, which could hurt quality on complex audio
- `low` is still very cheap on output tokens while maintaining multimodal fidelity
- The canary test will benchmark both `low` and `minimal` side by side to quantify the quality difference
- For retry segments where initial quality was poor, bump to `high`

---

## Concern 5: Audio Event Tags

**Decision: 10 tags, confidence-gated insertion**

Final tag list:

- `[laugh]` - most common in podcasts
- `[cough]` - frequent in natural speech
- `[sigh]` - audible exhalation
- `[breath]` - heavy/audible breathing (NOT normal inter-word breathing)
- `[throat_clear]` - very common in podcasts, distinct sound
- `[singing]` - musical vocal segments (distinct from background [music])
- `[noise]` - catch-all for non-speech sounds (mic bumps, clicks, taps)
- `[music]` - background/intro music
- `[applause]` - audience reactions (live recordings)
- `[sniff]` - common in casual speech, distinctly audible

Why these 10:

- YouTube podcasts are informal; laughter, breathing, throat clears, coughs dominate non-speech events
- `[throat_clear]` replaces `[click]` from the previous version - throat clearing is far more common in podcasts and more useful for TTS controllability
- `[noise]` is the catch-all for anything not in the other 9 categories (including mic clicks, tongue clicks)
- `[singing]` is kept (distinct from [music] for a cappella/humming moments)
- The prompt explicitly says: "ONLY if clearly and prominently audible" - this prevents hallucinated tags
- `tagged` must be CHARACTER-IDENTICAL to `transcription` except for inserted event tags. Do not re-listen or re-interpret.

---

## Concern 6: Speaker Metadata and Accent

**Decision: accent stays OPTIONAL (empty string when not confident). Rename `speaking_style: "excited"` to `"energetic"`.**

- `emotion`, `speaking_style`, `pace` = REQUIRED (reliably detectable from audio prosody)
- `accent` = optional, empty string if uncertain. Indian regional accent detection from audio alone is genuinely hard even for native speakers, and forcing it will produce hallucinated labels that poison training data
- The JSON schema already reflects this: `accent` is not in the `required` array
- For accent, better to have 30% coverage with high confidence than 100% coverage with 50% noise

**Schema change**: Rename `speaking_style: "excited"` to `"energetic"` to avoid ambiguity with `emotion: "excited"`. "Energetic" describes HOW someone speaks (delivery style), while "excited" describes WHAT they feel (emotional state).

Updated `speaking_style` enum: `["conversational", "narrative", "energetic", "calm", "emphatic", "sarcastic", "formal"]`

---

## Concern 7: Audio Segment Preprocessing (Boundary Handling)

**Decision: Energy-based trimming with silence padding**

This is the most critical preprocessing step. The algorithm:

```
1. Load FLAC segment, compute RMS energy in 10ms frames

2. START-OF-SEGMENT CHECK:
   - If first 50ms has RMS below silence_threshold (-40 dBFS):
     -> Clean start. Keep as-is.
   - If first 50ms has RMS ABOVE threshold:
     -> Scan forward for first "silence valley" (>= 50ms consecutive below threshold)
     -> If found within first 40% of segment: trim to that point
     -> If NOT found: mark segment as "abrupt_start" (still transcribe, just flag it)

3. END-OF-SEGMENT CHECK:
   - Mirror of step 2, scanning backwards
   - Same 40% limit from the end

4. SILENCE PADDING:
   - Prepend 150ms of silence (zero samples)
   - Append 150ms of silence
   
5. Why 150ms (not 100ms):
   - Standard TTS silence padding is 100-200ms
   - 150ms gives Gemini a clear "sentence boundary" signal
   - Matches silence duration between natural utterances
   - Any value in 100-200ms range is acceptable
```

Even if this trims 30-40% of segments, quality >> quantity for ASR/TTS training data. A perfectly cut 6s segment is worth more than a messy 10s segment.

Implementation: Use `librosa` for RMS computation and `soundfile` for FLAC I/O. Both are fast and handle FLAC natively.

---

## Concern 8: Segment Length Constraints

**Decision: Min 2s, target 2-10s, hard max 15s**

Re-cutting algorithm for segments > 10s:

```
1. Compute energy profile of the full segment
2. Find all silence valleys (>= 100ms below threshold)
3. Prefer the first valley AFTER 7s mark
4. If no valley between 7-12s: pick the LOWEST energy frame in 10-15s range
5. Split at that point, treat each piece as a new segment
6. Apply min-length filter: discard any resulting piece < 2s
7. Hard cap at 15s: if no split point found by 15s, force-trim at 15s
```

For segments < 2s: DISCARD. These are almost always background noise, acknowledgments ("hmm", "haa"), or cross-talk bleed from diarization errors - exactly the kind of noise you mentioned in your data quality notes.

---

## Concern 9: Transcript Validation Strategy

**Decision: 3-tier validation, implement Tier 1 now, design Tier 2+3 for later**

### Tier 1: Programmatic Checks (run on every segment, instant, free)

- **Empty/NO_SPEECH check**: transcription is empty, `[NO_SPEECH]`, or `[INAUDIBLE]`
- **Length ratio**: chars_per_second = len(transcription) / audio_duration. Flag if < 2 or > 30
- **Script check**: use Python `unicodedata` to verify characters match expected script
- **Language mismatch**: `detected_language != expected_language` -> flag
- **JSON validity**: already enforced by `response_json_schema`, but double-check parse
- **Tag consistency**: strip all `[tag]` tokens from `tagged`, result must equal `transcription` exactly
- **UNK/INAUDIBLE density**: count `[UNK]` and `[INAUDIBLE]` tokens, flag if > 20% of transcription
- **Overlap detection**: flag `overlap_suspected` based on boundary trimmer metadata
- Store composite `quality_score` (0-1) per segment

**ASR/TTS Lane Eligibility Flags** (computed from Tier 1 metrics):

- `asr_eligible`: true if non-empty transcript, quality_score > 0.3, reasonable length ratio
- `tts_clean_eligible`: true if clean boundaries (no abrupt start/end), single speaker, no overlap, quality_score > 0.7
- `tts_expressive_eligible`: true if tts_clean_eligible AND has approved event tags (for controllable TTS training)

### Tier 2: Cross-Reference Scoring (batch post-processing, later)

- **Primary cross-reference ASR**: `ai4bharat/indic-conformer-600m-multilingual` - covers all 22 Indian languages, best fit for this task
  - Compute CER between Gemini output and IndicConformer output (romanized via `aksharamukha`)
  - Low CER = both models agree = high confidence
- **Language ID verification**: Bhashini audio language detection models - supports exactly our 12 languages
- **Romanization**: Use `aksharamukha` for script-agnostic text comparison. Supports all 12 languages.
- **NeMo Forced Aligner**: For alignment scoring on the subset of languages it supports
- **MFA**: Limited Indic support (Hindi, Bengali only). Not viable as primary validator.
- **Note on speed**: Full validation of 80M segments requires GPU. Plan to sample-validate (e.g., 1% = 800K segments) or run on the flagged tail only.

### Tier 3: Cross-Model Validation (selective, expensive)

- Only for segments flagged by Tier 1 (low quality_score) or Tier 2 (high CER)
- Re-transcribe with Gemini 3 Pro
- If two models agree -> accept; disagree -> manual review queue
- Also useful: Google Cloud Speech-to-Text for word-level confidence on selected subsets

**Pragmatic approach**: Ship with Tier 1 only. Store all metrics. Tier 2 runs as an offline batch after the main 80M pipeline completes. Tier 3 is on-demand for the worst segments.

---

## Concern 10: Rate Limits and Throughput Planning

### The math:

**Target**: 80M segments / 100 hours = 222 segments/second sustained

**Per-segment token estimate** (verify in canary):

- Audio: ~8s avg at 32 tokens/s = ~256 audio tokens
- System prompt: ~400 text tokens (lean prompt, schema removed from text)
- User prompt: ~30 text tokens
- Output: ~200 text tokens (JSON response)
- Total per request: ~886 tokens

### AI Studio Realtime (PRIMARY) - 20K RPM, 20M TPM:

- Token-limited: 20M / 886 = ~22.5K RPM -> capped at 20K RPM = 333 RPS
- Over 100h: 333 * 3600 * 100 = **~120M segments** (capacity exceeds need)
- Practical throughput at 60-70% utilization: ~80-90M segments
- With implicit caching (system prompt cached), effective new tokens drop to ~486/request -> TPM headroom doubles
- **Concurrency**: 500-1000 concurrent async connections with semaphore + exponential backoff on 429s
- **Per-language queues**: allows prioritizing/pausing individual languages

### AI Studio/Vertex Batch API (SECONDARY - overflow/backlog):

- 50% cost savings but 24h SLO - too risky as primary for 4-day deadline
- Use for: backlog from Day 1-2, bulk reprocessing of failed segments, cost optimization on non-urgent languages
- Submit via JSONL files (up to 2GB each)
- At ~50KB per segment (FLAC base64): ~40K segments per JSONL file

### OpenRouter (TERTIARY - overflow only):

- Gemini 3 Flash via OpenRouter at standard pricing
- OpenAI-compatible API with `response_format: { type: "json_schema" }`
- Base64-encoded audio in message content
- Daily spend cap to control costs
- Activate only if: Day 3+ and behind schedule on Google lanes

---

## Concern 11: Cost Estimation

**Gemini 3 Flash Standard (Realtime) pricing** (per 1M tokens):

- Audio input: $1.00
- Text input: $0.50
- Output (incl. thinking): $3.00

**Per segment via Realtime** (with caching):

- Audio input: 256 tokens * $1.00/1M = $0.000256
- Text input: 30 tokens * $0.50/1M = $0.000015 (system prompt cached)
- Output: 200 tokens * $3.00/1M = $0.000600
- **Per segment total: ~$0.00087**

**80M segments * $0.00087 = ~$70,000 if all via realtime**

**Gemini 3 Flash Batch pricing** (50% cheaper):

- Per segment: ~$0.00044
- 80M segments = ~$35,000

**Blended estimate** (70% realtime, 30% batch): ~$60,000

**Note**: Run a 1000-segment canary FIRST to measure actual token counts. Gemini tokenizes audio at ~32 tokens/second but this varies by codec/bitrate. The canary will give precise per-segment costs.

---

## Concern 12: API Backend Strategy

**Priority order** (changed from original plan based on cross-model analysis - batch 24h SLO is too risky for 4-day deadline):

```
1. AI Studio Realtime API (PRIMARY - immediate feedback, adjustable)
   - Async with asyncio + semaphore (500-1000 concurrent)
   - Inline audio bytes (segments are <20MB, no File API needed)
   - Per-language priority queues
   - Immediate validation + storage after each response
   - Target: ~60-80M segments over 4 days
   
2. AI Studio Batch API (SECONDARY - cost savings on overflow)
   - Submit JSONL for bulk backlog/overflow
   - 50% cost savings but 24h turnaround SLO
   - Use for: non-urgent languages, reprocessing failed segments
   - Target: 10-20M segments as overflow
   
3. OpenRouter (TERTIARY - emergency overflow)
   - OpenAI-compatible API with structured outputs
   - Base64-encoded audio in messages
   - Daily spend cap
   - Activate only if: Day 3+ and behind schedule
```

**Processing unit**: `videoID.tar` (not individual segments)

- Lease tar via Supabase lock -> download once -> extract -> preprocess all segments -> submit all -> release lock
- This minimizes R2 round-trips and keeps segments grouped by video/language

---

## Concern 13: Prompt Caching

**Strategy: Cache the system prompt per language**

- System prompt is ~400 tokens (same for all segments of the same language)
- Gemini 3 Flash implicit caching: min 1024 tokens (system prompt alone is below this)
- For batch: include system_instruction in each request config. Gemini Batch API supports context caching internally
- For realtime: structure requests so system_instruction is consistent, triggering implicit cache hits
- The savings compound: at 80M requests, even 50% cache hit rate on prompt tokens saves millions of tokens

---

## Concern 14: Prompt Refinement

The current [prompt.txt](prompt.txt) needs significant changes. Key principle from Google's structured output docs: **schema belongs in the API parameter, not duplicated in the prompt text**. Duplicating wastes ~300 tokens/request and can lower quality.

Changes:

1. **Remove JSON schema block from prompt text**: Move entirely to `response_json_schema` API parameter. Keep only the FIELD DERIVATION instructions that describe WHAT goes in each field.
2. **Parameterize language**: Replace hardcoded "Telugu (te-IN)" with `EXPECTED LANGUAGE HINT: {language} ({lang_code})`. Make it clear this is a hint, not a constraint.
3. **Add number handling rule**: "Write numbers as digits (22, 500, 2024). Write currency/units as spoken. The goal is numeric representation, not spelled-out words."
4. **Add NO_SPEECH handling**: When no speech detected, output `transcription="[NO_SPEECH]"`, `tagged="[NO_SPEECH]"`, `detected_language="no_speech"`, default speaker metadata (neutral/conversational/normal/"").
5. **Strengthen tagged field**: "tagged must be CHARACTER-IDENTICAL to transcription except for inserted event tags. Do not re-listen or re-interpret. Copy transcription verbatim, then insert tags."
6. **Add boundary handling**: "Audio may start/end mid-speech. Transcribe only what you can confidently hear. If first/last word is cut off, omit it."
7. **Add breath tag clarification**: "Do NOT tag normal inter-word breathing. Only tag [breath] for audible, notable breaths or gasps."
8. **Fix typo**: "AUTHORITATIV E" -> "AUTHORITATIVE"
9. **Prompt compression**: Keep the prompt lean (~400 tokens). Every saved token * 80M requests = significant cost savings.
10. **Script mapping**: Include per-language script names in the template (Telugu -> Telugu script, Hindi -> Devanagari, etc.)

---

## Concern 15: Data Flow Architecture

```mermaid
flowchart TD
    subgraph data_source [Data Source]
        R2[R2: videoID.tar]
        SB[Supabase: video metadata]
    end

    subgraph preprocessing [Preprocessing Worker]
        DL[Download + Extract tar]
        META[Read metadata.json]
        TRIM[Trim boundaries + pad silence]
        FILTER[Filter: 2s to 15s]
        SPLIT[Re-split segments over 10s]
        ENCODE[Encode to base64]
    end

    subgraph batch_prep [Batch Preparation]
        PROMPT[Build prompt per language]
        JSONL[Create JSONL files with 40K segments each]
        UPLOAD[Upload JSONL to File API]
    end

    subgraph inference [Inference Backends]
        BATCH[AI Studio Batch API]
        REALTIME[AI Studio Realtime API]
        OR[OpenRouter API]
    end

    subgraph postprocessing [Post-Processing]
        PARSE[Parse JSON responses]
        VALIDATE[Tier 1 validation + scoring]
        STORE[Store in Supabase]
        FLAG[Flag low-confidence segments]
    end

    R2 --> DL
    SB --> META
    DL --> META
    META --> TRIM
    TRIM --> FILTER
    FILTER --> SPLIT
    SPLIT --> ENCODE
    ENCODE --> PROMPT
    PROMPT --> JSONL
    JSONL --> UPLOAD
    UPLOAD --> BATCH
    ENCODE --> REALTIME
    ENCODE --> OR
    BATCH --> PARSE
    REALTIME --> PARSE
    OR --> PARSE
    PARSE --> VALIDATE
    VALIDATE --> STORE
    VALIDATE --> FLAG
    FLAG --> REALTIME
```


---

## Concern 16: Supabase Schema

Comprehensive per-segment metadata is critical for reproducibility. Version everything.

`**transcription_results**` (main table):

- `id` (uuid, PK)
- `video_id`, `segment_file`, `speaker_id`
- `original_start_ms`, `original_end_ms` (from metadata.json)
- `trimmed_start_ms`, `trimmed_end_ms` (after boundary trimming)
- `leading_pad_ms`, `trailing_pad_ms`
- `expected_language_hint` (from Supabase video metadata)
- `detected_language` (ISO code from model response)
- `lang_mismatch_flag` (bool)
- `transcription`, `tagged`
- `speaker_emotion`, `speaker_style`, `speaker_pace`, `speaker_accent`
- `num_unk`, `num_inaudible`, `num_event_tags`
- `boundary_score`, `text_length_per_sec`
- `overlap_suspected` (bool)
- `quality_score` (0-1 composite)
- `asr_eligible`, `tts_clean_eligible`, `tts_expressive_eligible` (bool lane flags)
- `prompt_version`, `schema_version`, `trimmer_version`, `validator_version`
- `model_id`, `temperature`, `thinking_level`
- `provider` (aistudio_realtime / aistudio_batch / openrouter)
- `token_usage_json` (input/output/cached token counts)
- `created_at`

`**transcription_jobs**`: job_id, batch_name, status, segment_count, completed_count, failed_count, provider, created_at, completed_at

`**transcription_flags**`: segment_id, flag_type, details, resolved (bool), resolved_at

---

## Implementation Modules

The pipeline will be built as a Python package with these modules:

- `config.py` - env vars, constants, language mappings
- `r2_client.py` - R2 download/extract
- `audio_preprocess.py` - trim, pad, length filter, re-split
- `prompt_builder.py` - parameterized prompt per language
- `gemini_batch.py` - JSONL creation, batch submission, polling
- `gemini_realtime.py` - async realtime API calls
- `openrouter_client.py` - OpenRouter fallback
- `validator.py` - Tier 1 validation + scoring
- `db.py` - Supabase read/write
- `pipeline.py` - orchestrator tying everything together
- `main.py` - CLI entry point