---
name: Gemini Transcription Pipeline
overview: Build a large-scale audio transcription pipeline using Gemini 3 Flash to process ~80M audio segments across 12 Indian languages + English, with audio preprocessing, structured JSON output, multi-tier validation, and a 3-backend strategy (AI Studio Realtime primary, Batch secondary, OpenRouter overflow) to complete within 4 days.
todos:
  - id: setup-env
    content: "Project setup: venv, requirements.txt (google-genai, boto3, soundfile, librosa, numpy, asyncio, httpx, supabase-py, pydantic), .gitignore, README.md"
    status: pending
  - id: config-module
    content: "Config module: load .env, define 12 language mappings (lang -> ISO code, script name, Unicode ranges), detected_language enum"
    status: pending
  - id: r2-client
    content: "R2 client: lease videoID.tar from Supabase -> download from R2 -> extract metadata.json + segments/*.flac -> release lock. Tar is the processing unit."
    status: pending
  - id: audio-preprocess
    content: "Audio preprocessing: RMS-based boundary trimming, 150ms silence padding, length filtering (2-15s), re-splitting segments over 10s at silence valleys, store trim metadata"
    status: pending
  - id: prompt-builder
    content: "Prompt builder: lean parameterized system prompt (no schema in text), JSON schema via Pydantic with ISO lang enum + energetic rename, number-as-digits rule, NO_SPEECH handling"
    status: pending
  - id: gemini-realtime
    content: "Gemini Realtime backend (PRIMARY): async client with asyncio + semaphore (500-1000 concurrent), inline audio bytes, exponential backoff on 429s, per-language queues"
    status: pending
  - id: gemini-batch
    content: "Gemini Batch backend (SECONDARY): create JSONL files with inline base64 audio, upload via File API, submit batch jobs for overflow/backlog, poll for results"
    status: pending
  - id: openrouter-client
    content: "OpenRouter fallback (TERTIARY): OpenAI-compatible API, structured outputs via response_format json_schema, daily spend cap, only when Google lanes saturated"
    status: pending
  - id: validator
    content: "Tier 1 validator: empty/NO_SPEECH check, length ratio, script detection, language mismatch, tag consistency, quality_score, ASR/TTS lane eligibility flags, overlap detection"
    status: pending
  - id: db-module
    content: "Supabase DB module: comprehensive per-segment metadata (trim offsets, version tracking, provider, lane flags), transcription_jobs, transcription_flags tables"
    status: pending
  - id: pipeline-orchestrator
    content: "Pipeline orchestrator: tar-based processing unit, coordinate R2 lease -> preprocess -> realtime inference -> validation -> storage, with per-language progress tracking"
    status: pending
  - id: canary-test
    content: "Canary test: 1000 segments stratified across all 12 languages, multiple duration bins, test thinking_level low vs minimal, measure token usage, validate quality"
    status: pending
isProject: false
---

# Gemini Audio Transcription Pipeline - Comprehensive Plan

This plan addresses EVERY concern raised in [instructions.md](instructions.md) and is backed with reasoning for each decision.

---

## Concern 1: Code-Mixed vs Transliterated vs Single-Script Transcription

**Decision: Code-mixed transcription (each language in its own native script)**

Your current [prompt.txt](prompt.txt) already does this correctly ("Write Telugu words in Telugu script. Keep English words in English (Latin script)"). This is the right call because:

- Gemini's audio perception natively distinguishes languages; forcing transliteration adds an error-prone conversion step
- Code-mixed preserves maximum information (you know exactly which words are in which language)
- For TTS training, you need native script to map to correct phonemes. For ASR training, code-mixed is the ground truth of how people actually speak
- You can always convert to romanized/transliterated later with a cheap text-only LLM call, but you cannot recover script information from a romanized-only transcript
- Punctuation stays ON (comma, period, ?, !) since you can always strip it for training but cannot reliably add it later

---

## Concern 2: Language Mismatch and detected_language Handling

**Decision: Pass expected language as a soft hint; trust the model's perception**

Strategy:

- System prompt says: `Expected language: {language}. This is a hint, not a constraint. Trust what you hear.`
- `detected_language` acts as a cross-validation signal: if it differs from the expected language, flag the segment for review
- For code-mixed audio, `detected_language` reports the DOMINANT language (the one spoken for the majority of the segment)
- Post-processing: any segment where `detected_language != expected_language` gets a `language_mismatch` flag in Supabase for later batch review

This avoids both failure modes: (a) Gemini confusing languages when no hint is given, and (b) Gemini forcing everything into the wrong language when the hint is wrong.

---

## Concern 3: Temperature Setting

**Decision: Use temperature=0 for deterministic transcription.**

Transcription is a deterministic mapping task - there is one correct transcription for any audio. temperature=0 minimizes hallucination and ensures reproducibility.

**Risk note**: Gemini 3 docs recommend temp=1.0 and warn that lower temperatures "may lead to unexpected behavior, such as looping or degraded performance." This warning applies mainly to reasoning/math tasks. For transcription with structured output, temp=0 is standard practice. **The canary test (1000 segments) MUST monitor for output looping** - if any looping is detected, we revert to temp=1.0 immediately.

Additional settings:

- `top_p = 1.0` (no nucleus sampling interference)
- `top_k = 1` (greedy decoding)
- `candidate_count = 1`
- `response_mime_type = "application/json"`
- `response_json_schema` = Pydantic-generated schema

---

## Concern 4: Thinking Level

**Decision: `thinking_level: "minimal"` for Gemini 3 Flash**

Reasoning:

- Transcription is a perception + encoding task, not a reasoning/math task
- `minimal` = "matches no thinking for most queries" - perfect for transcription
- Saves output tokens (thinking tokens are billed at $3.00/M for standard, $1.50/M for batch)
- At 80M segments, even 50 fewer thinking tokens per request = 4B fewer billed output tokens
- The prompt already instructs the model HOW to transcribe; it doesn't need to "reason" about it
- For retry segments where initial quality was poor, bump to `low`

---

## Concern 5: Audio Event Tags

**Decision: 10 tags, confidence-gated insertion**

Final tag list:

- `[laugh]` - most common in podcasts
- `[cough]` - frequent in natural speech
- `[sigh]` - audible exhalation
- `[breath]` - heavy/audible breathing
- `[singing]` - musical vocal segments
- `[noise]` - catch-all for non-speech sounds (mic bumps, clicks, taps)
- `[music]` - background/intro music
- `[applause]` - audience reactions (live recordings)
- `[sniff]` - common in casual speech, distinctly audible
- `[click]` - tongue/mouth clicks, distinct from [noise]

Why these 10:

- YouTube podcasts are informal; laughter, breathing, coughs dominate non-speech events
- `[noise]` is the catch-all for anything not in the other 9 categories
- `[sniff]` and `[click]` are common in natural speech and useful for TTS naturalness
- The prompt explicitly says: "ONLY if clearly and prominently audible" - this prevents hallucinated tags
- The `tagged` field is derived from `transcription` (copy + insert tags), NOT re-interpreted

---

## Concern 6: Speaker Metadata and Accent

**Decision: accent stays OPTIONAL (empty string when not confident)**

- `emotion`, `speaking_style`, `pace` = REQUIRED (these are reliably detectable from audio prosody)
- `accent` = optional, empty string if uncertain. Indian regional accent detection from audio alone is genuinely hard even for native speakers, and forcing it will produce hallucinated labels that poison training data
- The JSON schema already reflects this: `accent` is not in the `required` array
- For accent, better to have 30% coverage with high confidence than 100% coverage with 50% noise

---

## Concern 7: Audio Segment Preprocessing (Boundary Handling)

**Decision: Energy-based trimming with silence padding**

This is the most critical preprocessing step. The algorithm:

```
1. Load FLAC segment, compute RMS energy in 10ms frames

2. START-OF-SEGMENT CHECK:
   - If first 50ms has RMS below silence_threshold (-40 dBFS):
     -> Clean start. Keep as-is.
   - If first 50ms has RMS ABOVE threshold:
     -> Scan forward for first "silence valley" (>= 50ms consecutive below threshold)
     -> If found within first 40% of segment: trim to that point
     -> If NOT found: mark segment as "abrupt_start" (still transcribe, just flag it)

3. END-OF-SEGMENT CHECK:
   - Mirror of step 2, scanning backwards
   - Same 40% limit from the end

4. SILENCE PADDING:
   - Prepend 150ms of silence (zero samples)
   - Append 150ms of silence
   
5. Why 150ms (not 100ms):
   - Standard TTS silence padding is 100-200ms
   - 150ms gives Gemini a clear "sentence boundary" signal
   - Matches silence duration between natural utterances
   - Any value in 100-200ms range is acceptable
```

Even if this trims 30-40% of segments, quality >> quantity for ASR/TTS training data. A perfectly cut 6s segment is worth more than a messy 10s segment.

Implementation: Use `librosa` for RMS computation and `soundfile` for FLAC I/O. Both are fast and handle FLAC natively.

---

## Concern 8: Segment Length Constraints

**Decision: Min 2s, target 2-10s, hard max 15s**

Re-cutting algorithm for segments > 10s:

```
1. Compute energy profile of the full segment
2. Find all silence valleys (>= 100ms below threshold)
3. Prefer the first valley AFTER 7s mark
4. If no valley between 7-12s: pick the LOWEST energy frame in 10-15s range
5. Split at that point, treat each piece as a new segment
6. Apply min-length filter: discard any resulting piece < 2s
7. Hard cap at 15s: if no split point found by 15s, force-trim at 15s
```

For segments < 2s: DISCARD. These are almost always background noise, acknowledgments ("hmm", "haa"), or cross-talk bleed from diarization errors - exactly the kind of noise you mentioned in your data quality notes.

---

## Concern 9: Transcript Validation Strategy

**Decision: 3-tier validation, implement Tier 1 now, design Tier 2+3 for later**

### Tier 1: Programmatic Checks (run on every segment, instant, free)

- **Empty check**: `transcription` is empty or `[NO_SPEECH]` or `[INAUDIBLE]`
- **Length ratio**: chars_per_second = len(transcription) / audio_duration. Flag if < 2 or > 30 (impossible speech rate)
- **Script check**: use Python `unicodedata` to verify characters match expected script (Telugu chars in Telugu segments, etc.)
- **Language mismatch**: `detected_language != expected_language` -> flag
- **JSON validity**: already enforced by `response_json_schema`, but double-check parse
- **Tag consistency**: `tagged` field should be a superset of `transcription` (same text + tags)
- Store a composite `quality_score` (0-1) per segment

### Tier 2: Alignment-Based Scoring (batch post-processing, later)

- **For Indic languages**: Use `ai4bharat/IndicWav2Vec` or `Whisper large-v3` as a cross-reference ASR
  - Run Whisper on the same audio, compare outputs using character error rate (CER) / word error rate (WER)
  - Low WER = high confidence in Gemini's transcript
- **Romanization**: Use `aksharamukha` (Python library) for programmatic script conversion. Supports all 12 of your languages. Convert both Gemini output and reference output to Roman, then compute normalized edit distance
- **MFA (Montreal Forced Aligner)**: Has limited Indic support (Hindi, Bengali). For other Indic languages, use `NeMo` forced aligner or `CTC segmentation` from `wav2letter`
- **Note on MFA speed**: MFA processes ~100x realtime on CPU. For 80M segments of avg 8s = 640M seconds of audio. At 100x realtime, that's 6.4M seconds = 74 days on 1 CPU. Need GPU-accelerated alternatives or only sample-validate.

### Tier 3: Cross-Model Validation (selective, expensive)

- Only for segments flagged by Tier 1 (low quality_score) or Tier 2 (high WER)
- Re-transcribe with Gemini 3 Pro (slower, more expensive, but more accurate)
- If two models agree -> accept; disagree -> manual review queue

**Pragmatic approach**: Ship with Tier 1 only. Store all metrics. Tier 2 runs as an offline batch after the main 80M pipeline completes. Tier 3 is on-demand for the worst segments.

---

## Concern 10: Rate Limits and Throughput Planning

### The math:

**Target**: 80M segments / 100 hours = 222 segments/second sustained

**Per-segment token estimate** (need to verify with test batch):

- Audio: ~8s avg at ~25 tokens/s = ~200 audio tokens
- System prompt: ~400 text tokens (cached after first request)
- User prompt: ~30 text tokens
- Output: ~200 text tokens (JSON response)
- Total per request: ~830 tokens

### AI Studio Realtime (20K RPM, 20M TPM):

- Token-limited: 20M / 830 = ~24K RPM -> capped at 20K RPM = 333 RPS
- Over 100h: 333 * 3600 * 100 = **~120M segments** (capacity exceeds need)
- BUT: practical throughput with network latency + async overhead is ~60-70% = ~80-90M
- With prompt caching (system instruction cached), effective token count drops to ~430 new tokens/request -> TPM headroom doubles

### AI Studio/Vertex Batch API:

- 50% cost savings
- 24h SLO (usually faster)
- Submit via JSONL files (up to 2GB each)
- At ~50KB per segment (FLAC base64 encoded): ~40K segments per JSONL file
- 80M segments / 40K = 2000 JSONL files
- Can submit in waves: 500 batch jobs/day over 4 days

### OpenRouter (fallback only):

- Gemini 3 Flash via OpenRouter at standard pricing
- Use OpenAI-compatible API with `response_format: { type: "json_schema" }`
- Activate only if AI Studio rate limits are saturated or Batch API queues are backed up

---

## Concern 11: Cost Estimation

**Gemini 3 Flash Batch pricing** (per 1M tokens):

- Audio input: $0.50
- Text input: $0.25
- Output: $1.50

**Per segment** (with caching - system prompt cached at $0.10/M audio):

- Audio input: 200 tokens * $0.50/1M = $0.000100
- Text input: 30 tokens * $0.25/1M = $0.0000075 (system prompt cached)
- Output: 200 tokens * $1.50/1M = $0.000300
- **Per segment total: ~$0.0004**

**80M segments * $0.0004 = ~$32,000 for batch**

For realtime (double the batch price): ~$64,000 if all done via realtime.

**Note**: Run a 1000-segment test batch FIRST to measure actual token counts. The audio tokenization rate is critical and varies by codec/bitrate.

---

## Concern 12: API Backend Strategy

**Priority order**:

```
1. AI Studio Batch API (PRIMARY - 50% cost, async, bulk)
   - Submit JSONL with inline audio base64
   - Poll for results every 5 minutes
   - Process results as they complete
   
2. AI Studio Realtime API (SECONDARY - retries, urgent)
   - Async with asyncio + semaphore (200 concurrent)
   - Used for: failed batch segments, validation retries, testing
   
3. OpenRouter (TERTIARY - overflow only)
   - OpenAI-compatible API
   - Only if: Day 3+ and behind schedule
   - Supports Gemini 3 Flash with structured outputs
```

---

## Concern 13: Prompt Caching

**Strategy: Cache the system prompt per language**

- System prompt is ~400 tokens (same for all segments of the same language)
- Gemini 3 Flash implicit caching: min 1024 tokens (system prompt alone is below this)
- For batch: include system_instruction in each request config. Gemini Batch API supports context caching internally
- For realtime: structure requests so system_instruction is consistent, triggering implicit cache hits
- The savings compound: at 80M requests, even 50% cache hit rate on prompt tokens saves millions of tokens

---

## Concern 14: Prompt Refinement

The current [prompt.txt](prompt.txt) is solid but needs these changes:

1. **Parameterize language**: Replace hardcoded "Telugu (te-IN)" with `{language} ({lang_code})`
2. **Remove temperature guidance from prompt**: Temperature is an API parameter, not prompt text
3. **Strengthen tagged field instruction**: "Copy `transcription` verbatim, then insert tags. Do NOT re-listen or re-interpret."
4. **Add boundary handling note**: "Audio may start/end mid-speech. Transcribe only what you can confidently hear. If first/last word is cut off, omit it."
5. **Add script mapping table**: For each language, specify its script name (Telugu -> Telugu script, Hindi -> Devanagari, etc.)

---

## Concern 15: Data Flow Architecture

```mermaid
flowchart TD
    subgraph data_source [Data Source]
        R2[R2: videoID.tar]
        SB[Supabase: video metadata]
    end

    subgraph preprocessing [Preprocessing Worker]
        DL[Download + Extract tar]
        META[Read metadata.json]
        TRIM[Trim boundaries + pad silence]
        FILTER[Filter: 2s to 15s]
        SPLIT[Re-split segments over 10s]
        ENCODE[Encode to base64]
    end

    subgraph batch_prep [Batch Preparation]
        PROMPT[Build prompt per language]
        JSONL[Create JSONL files with 40K segments each]
        UPLOAD[Upload JSONL to File API]
    end

    subgraph inference [Inference Backends]
        BATCH[AI Studio Batch API]
        REALTIME[AI Studio Realtime API]
        OR[OpenRouter API]
    end

    subgraph postprocessing [Post-Processing]
        PARSE[Parse JSON responses]
        VALIDATE[Tier 1 validation + scoring]
        STORE[Store in Supabase]
        FLAG[Flag low-confidence segments]
    end

    R2 --> DL
    SB --> META
    DL --> META
    META --> TRIM
    TRIM --> FILTER
    FILTER --> SPLIT
    SPLIT --> ENCODE
    ENCODE --> PROMPT
    PROMPT --> JSONL
    JSONL --> UPLOAD
    UPLOAD --> BATCH
    ENCODE --> REALTIME
    ENCODE --> OR
    BATCH --> PARSE
    REALTIME --> PARSE
    OR --> PARSE
    PARSE --> VALIDATE
    VALIDATE --> STORE
    VALIDATE --> FLAG
    FLAG --> REALTIME
```


---

## Concern 16: Supabase Schema

New tables needed:

- `**transcription_results**`: video_id, segment_file, transcription, tagged, speaker_json, detected_language, quality_score, language_mismatch (bool), api_backend (batch/realtime/openrouter), token_usage, created_at
- `**transcription_jobs**`: job_id, batch_name, status, segment_count, completed_count, failed_count, created_at, completed_at
- `**transcription_flags**`: segment_id, flag_type (low_score/language_mismatch/empty/etc), resolved (bool)

---

## Implementation Modules

The pipeline will be built as a Python package with these modules:

- `config.py` - env vars, constants, language mappings
- `r2_client.py` - R2 download/extract
- `audio_preprocess.py` - trim, pad, length filter, re-split
- `prompt_builder.py` - parameterized prompt per language
- `gemini_batch.py` - JSONL creation, batch submission, polling
- `gemini_realtime.py` - async realtime API calls
- `openrouter_client.py` - OpenRouter fallback
- `validator.py` - Tier 1 validation + scoring
- `db.py` - Supabase read/write
- `pipeline.py` - orchestrator tying everything together
- `main.py` - CLI entry point