---
name: Gemini Transcription Pipeline
overview: Build a large-scale audio transcription pipeline using Gemini 3 Flash to process ~80M audio segments across 12 Indian languages + English, with audio preprocessing, structured JSON output, multi-tier validation, and a 3-backend strategy (AI Studio Batch, AI Studio Realtime, OpenRouter) to complete within 4 days.
todos:
  - id: setup-env
    content: "Project setup: venv, requirements.txt (google-genai, boto3, soundfile, librosa, numpy, asyncio, httpx, supabase-py, pydantic), .gitignore, README.md"
    status: pending
  - id: config-module
    content: "Config module: load .env, define 12 language mappings (lang -> script name, lang code, Unicode ranges for validation)"
    status: pending
  - id: r2-client
    content: "R2 client: download videoID.tar from R2, extract metadata.json + segments/*.flac, return segment list with metadata"
    status: pending
  - id: audio-preprocess
    content: "Audio preprocessing: RMS-based boundary trimming, 150ms silence padding, length filtering (2-15s), re-splitting segments over 10s at silence valleys"
    status: pending
  - id: prompt-builder
    content: "Prompt builder: parameterized system prompt template for all 12 languages, JSON schema definition via Pydantic, prompt refinements from plan"
    status: pending
  - id: gemini-batch
    content: "Gemini Batch backend: create JSONL files (40K segments each) with inline base64 audio, upload via File API, submit batch jobs, poll for results, parse JSONL responses"
    status: pending
  - id: gemini-realtime
    content: "Gemini Realtime backend: async client with semaphore (200 concurrent), retry logic with exponential backoff, for retries and urgent processing"
    status: pending
  - id: openrouter-client
    content: "OpenRouter fallback backend: OpenAI-compatible API, structured outputs via response_format json_schema, only activated when needed"
    status: pending
  - id: validator
    content: "Tier 1 validator: empty check, length ratio, script detection, language mismatch, tag consistency, composite quality_score (0-1)"
    status: pending
  - id: db-module
    content: "Supabase DB module: create tables (transcription_results, transcription_jobs, transcription_flags), batch insert results, update job status"
    status: pending
  - id: pipeline-orchestrator
    content: "Pipeline orchestrator: coordinate R2 download -> preprocess -> batch submission -> result collection -> validation -> storage, with progress tracking"
    status: pending
  - id: test-run
    content: "Test run: process 100 segments across 3 languages to verify end-to-end flow, measure actual token usage, validate quality, tune parameters before full-scale run"
    status: pending
isProject: false
---

# Gemini Audio Transcription Pipeline - Comprehensive Plan

This plan addresses EVERY concern raised in [instructions.md](instructions.md) and is backed with reasoning for each decision.

---

## Concern 1: Code-Mixed vs Transliterated vs Single-Script Transcription

**Decision: Code-mixed transcription (each language in its own native script)**

Your current [prompt.txt](prompt.txt) already does this correctly ("Write Telugu words in Telugu script. Keep English words in English (Latin script)"). This is the right call because:

- Gemini's audio perception natively distinguishes languages; forcing transliteration adds an error-prone conversion step
- Code-mixed preserves maximum information (you know exactly which words are in which language)
- For TTS training, you need native script to map to correct phonemes. For ASR training, code-mixed is the ground truth of how people actually speak
- You can always convert to romanized/transliterated later with a cheap text-only LLM call, but you cannot recover script information from a romanized-only transcript
- Punctuation stays ON (comma, period, ?, !) since you can always strip it for training but cannot reliably add it later

---

## Concern 2: Language Mismatch and detected_language Handling

**Decision: Pass expected language as a soft hint; trust the model's perception**

Strategy:
- System prompt says: `Expected language: {language}. This is a hint, not a constraint. Trust what you hear.`
- `detected_language` acts as a cross-validation signal: if it differs from the expected language, flag the segment for review
- For code-mixed audio, `detected_language` reports the DOMINANT language (the one spoken for the majority of the segment)
- Post-processing: any segment where `detected_language != expected_language` gets a `language_mismatch` flag in Supabase for later batch review

This avoids both failure modes: (a) Gemini confusing languages when no hint is given, and (b) Gemini forcing everything into the wrong language when the hint is wrong.

---

## Concern 3: Temperature Setting

**Decision: Use temperature=1.0 (Gemini 3 default). Do NOT use 0.**

From the Gemini 3 docs ([gemini3.md](docs/gemini3.md), line 261-263):

> "For Gemini 3, we strongly recommend keeping the temperature parameter at its default value of 1.0. Changing the temperature (setting it below 1.0) may lead to unexpected behavior, such as looping or degraded performance"

Your prior experience with temp=0 was likely on Gemini 2.x. Gemini 3 uses dynamic thinking internally which already provides determinism without needing temperature=0. The model's reasoning engine self-regulates randomness. Setting temp=0 on Gemini 3 specifically risks output looping (repeating the same token) which would be catastrophic at 80M scale.

---

## Concern 4: Thinking Level

**Decision: `thinking_level: "minimal"` for Gemini 3 Flash**

Reasoning:
- Transcription is a perception + encoding task, not a reasoning/math task
- `minimal` = "matches no thinking for most queries" - perfect for transcription
- Saves output tokens (thinking tokens are billed at $3.00/M for standard, $1.50/M for batch)
- At 80M segments, even 50 fewer thinking tokens per request = 4B fewer billed output tokens
- The prompt already instructs the model HOW to transcribe; it doesn't need to "reason" about it
- For retry segments where initial quality was poor, bump to `low`

---

## Concern 5: Audio Event Tags

**Decision: 10 tags, confidence-gated insertion**

Final tag list:
- `[laugh]` - most common in podcasts
- `[cough]` - frequent in natural speech
- `[sigh]` - audible exhalation
- `[breath]` - heavy/audible breathing
- `[singing]` - musical vocal segments
- `[noise]` - catch-all for non-speech sounds (mic bumps, clicks, taps)
- `[music]` - background/intro music
- `[applause]` - audience reactions (live recordings)
- `[sniff]` - common in casual speech, distinctly audible
- `[click]` - tongue/mouth clicks, distinct from [noise]

Why these 10:
- YouTube podcasts are informal; laughter, breathing, coughs dominate non-speech events
- `[noise]` is the catch-all for anything not in the other 9 categories
- `[sniff]` and `[click]` are common in natural speech and useful for TTS naturalness
- The prompt explicitly says: "ONLY if clearly and prominently audible" - this prevents hallucinated tags
- The `tagged` field is derived from `transcription` (copy + insert tags), NOT re-interpreted

---

## Concern 6: Speaker Metadata and Accent

**Decision: accent stays OPTIONAL (empty string when not confident)**

- `emotion`, `speaking_style`, `pace` = REQUIRED (these are reliably detectable from audio prosody)
- `accent` = optional, empty string if uncertain. Indian regional accent detection from audio alone is genuinely hard even for native speakers, and forcing it will produce hallucinated labels that poison training data
- The JSON schema already reflects this: `accent` is not in the `required` array
- For accent, better to have 30% coverage with high confidence than 100% coverage with 50% noise

---

## Concern 7: Audio Segment Preprocessing (Boundary Handling)

**Decision: Energy-based trimming with silence padding**

This is the most critical preprocessing step. The algorithm:

```
1. Load FLAC segment, compute RMS energy in 10ms frames

2. START-OF-SEGMENT CHECK:
   - If first 50ms has RMS below silence_threshold (-40 dBFS):
     -> Clean start. Keep as-is.
   - If first 50ms has RMS ABOVE threshold:
     -> Scan forward for first "silence valley" (>= 50ms consecutive below threshold)
     -> If found within first 40% of segment: trim to that point
     -> If NOT found: mark segment as "abrupt_start" (still transcribe, just flag it)

3. END-OF-SEGMENT CHECK:
   - Mirror of step 2, scanning backwards
   - Same 40% limit from the end

4. SILENCE PADDING:
   - Prepend 150ms of silence (zero samples)
   - Append 150ms of silence
   
5. Why 150ms (not 100ms):
   - Standard TTS silence padding is 100-200ms
   - 150ms gives Gemini a clear "sentence boundary" signal
   - Matches silence duration between natural utterances
   - Any value in 100-200ms range is acceptable
```

Even if this trims 30-40% of segments, quality >> quantity for ASR/TTS training data. A perfectly cut 6s segment is worth more than a messy 10s segment.

Implementation: Use `librosa` for RMS computation and `soundfile` for FLAC I/O. Both are fast and handle FLAC natively.

---

## Concern 8: Segment Length Constraints

**Decision: Min 2s, target 2-10s, hard max 15s**

Re-cutting algorithm for segments > 10s:

```
1. Compute energy profile of the full segment
2. Find all silence valleys (>= 100ms below threshold)
3. Prefer the first valley AFTER 7s mark
4. If no valley between 7-12s: pick the LOWEST energy frame in 10-15s range
5. Split at that point, treat each piece as a new segment
6. Apply min-length filter: discard any resulting piece < 2s
7. Hard cap at 15s: if no split point found by 15s, force-trim at 15s
```

For segments < 2s: DISCARD. These are almost always background noise, acknowledgments ("hmm", "haa"), or cross-talk bleed from diarization errors - exactly the kind of noise you mentioned in your data quality notes.

---

## Concern 9: Transcript Validation Strategy

**Decision: 3-tier validation, implement Tier 1 now, design Tier 2+3 for later**

### Tier 1: Programmatic Checks (run on every segment, instant, free)
- **Empty check**: `transcription` is empty or `[NO_SPEECH]` or `[INAUDIBLE]`
- **Length ratio**: chars_per_second = len(transcription) / audio_duration. Flag if < 2 or > 30 (impossible speech rate)
- **Script check**: use Python `unicodedata` to verify characters match expected script (Telugu chars in Telugu segments, etc.)
- **Language mismatch**: `detected_language != expected_language` -> flag
- **JSON validity**: already enforced by `response_json_schema`, but double-check parse
- **Tag consistency**: `tagged` field should be a superset of `transcription` (same text + tags)
- Store a composite `quality_score` (0-1) per segment

### Tier 2: Alignment-Based Scoring (batch post-processing, later)
- **For Indic languages**: Use `ai4bharat/IndicWav2Vec` or `Whisper large-v3` as a cross-reference ASR
  - Run Whisper on the same audio, compare outputs using character error rate (CER) / word error rate (WER)
  - Low WER = high confidence in Gemini's transcript
- **Romanization**: Use `aksharamukha` (Python library) for programmatic script conversion. Supports all 12 of your languages. Convert both Gemini output and reference output to Roman, then compute normalized edit distance
- **MFA (Montreal Forced Aligner)**: Has limited Indic support (Hindi, Bengali). For other Indic languages, use `NeMo` forced aligner or `CTC segmentation` from `wav2letter`
- **Note on MFA speed**: MFA processes ~100x realtime on CPU. For 80M segments of avg 8s = 640M seconds of audio. At 100x realtime, that's 6.4M seconds = 74 days on 1 CPU. Need GPU-accelerated alternatives or only sample-validate.

### Tier 3: Cross-Model Validation (selective, expensive)
- Only for segments flagged by Tier 1 (low quality_score) or Tier 2 (high WER)
- Re-transcribe with Gemini 3 Pro (slower, more expensive, but more accurate)
- If two models agree -> accept; disagree -> manual review queue

**Pragmatic approach**: Ship with Tier 1 only. Store all metrics. Tier 2 runs as an offline batch after the main 80M pipeline completes. Tier 3 is on-demand for the worst segments.

---

## Concern 10: Rate Limits and Throughput Planning

### The math:

**Target**: 80M segments / 100 hours = 222 segments/second sustained

**Per-segment token estimate** (need to verify with test batch):
- Audio: ~8s avg at ~25 tokens/s = ~200 audio tokens
- System prompt: ~400 text tokens (cached after first request)
- User prompt: ~30 text tokens
- Output: ~200 text tokens (JSON response)
- Total per request: ~830 tokens

### AI Studio Realtime (20K RPM, 20M TPM):
- Token-limited: 20M / 830 = ~24K RPM -> capped at 20K RPM = 333 RPS
- Over 100h: 333 * 3600 * 100 = **~120M segments** (capacity exceeds need)
- BUT: practical throughput with network latency + async overhead is ~60-70% = ~80-90M
- With prompt caching (system instruction cached), effective token count drops to ~430 new tokens/request -> TPM headroom doubles

### AI Studio/Vertex Batch API:
- 50% cost savings
- 24h SLO (usually faster)
- Submit via JSONL files (up to 2GB each)
- At ~50KB per segment (FLAC base64 encoded): ~40K segments per JSONL file
- 80M segments / 40K = 2000 JSONL files
- Can submit in waves: 500 batch jobs/day over 4 days

### OpenRouter (fallback only):
- Gemini 3 Flash via OpenRouter at standard pricing
- Use OpenAI-compatible API with `response_format: { type: "json_schema" }`
- Activate only if AI Studio rate limits are saturated or Batch API queues are backed up

---

## Concern 11: Cost Estimation

**Gemini 3 Flash Batch pricing** (per 1M tokens):
- Audio input: $0.50
- Text input: $0.25
- Output: $1.50

**Per segment** (with caching - system prompt cached at $0.10/M audio):
- Audio input: 200 tokens * $0.50/1M = $0.000100
- Text input: 30 tokens * $0.25/1M = $0.0000075 (system prompt cached)
- Output: 200 tokens * $1.50/1M = $0.000300
- **Per segment total: ~$0.0004**

**80M segments * $0.0004 = ~$32,000 for batch**

For realtime (double the batch price): ~$64,000 if all done via realtime.

**Note**: Run a 1000-segment test batch FIRST to measure actual token counts. The audio tokenization rate is critical and varies by codec/bitrate.

---

## Concern 12: API Backend Strategy

**Priority order**:

```
1. AI Studio Batch API (PRIMARY - 50% cost, async, bulk)
   - Submit JSONL with inline audio base64
   - Poll for results every 5 minutes
   - Process results as they complete
   
2. AI Studio Realtime API (SECONDARY - retries, urgent)
   - Async with asyncio + semaphore (200 concurrent)
   - Used for: failed batch segments, validation retries, testing
   
3. OpenRouter (TERTIARY - overflow only)
   - OpenAI-compatible API
   - Only if: Day 3+ and behind schedule
   - Supports Gemini 3 Flash with structured outputs
```

---

## Concern 13: Prompt Caching

**Strategy: Cache the system prompt per language**

- System prompt is ~400 tokens (same for all segments of the same language)
- Gemini 3 Flash implicit caching: min 1024 tokens (system prompt alone is below this)
- For batch: include system_instruction in each request config. Gemini Batch API supports context caching internally
- For realtime: structure requests so system_instruction is consistent, triggering implicit cache hits
- The savings compound: at 80M requests, even 50% cache hit rate on prompt tokens saves millions of tokens

---

## Concern 14: Prompt Refinement

The current [prompt.txt](prompt.txt) is solid but needs these changes:

1. **Parameterize language**: Replace hardcoded "Telugu (te-IN)" with `{language} ({lang_code})`
2. **Remove temperature guidance from prompt**: Temperature is an API parameter, not prompt text
3. **Strengthen tagged field instruction**: "Copy `transcription` verbatim, then insert tags. Do NOT re-listen or re-interpret."
4. **Add boundary handling note**: "Audio may start/end mid-speech. Transcribe only what you can confidently hear. If first/last word is cut off, omit it."
5. **Add script mapping table**: For each language, specify its script name (Telugu -> Telugu script, Hindi -> Devanagari, etc.)

---

## Concern 15: Data Flow Architecture

```mermaid
flowchart TD
    subgraph data_source [Data Source]
        R2[R2: videoID.tar]
        SB[Supabase: video metadata]
    end

    subgraph preprocessing [Preprocessing Worker]
        DL[Download + Extract tar]
        META[Read metadata.json]
        TRIM[Trim boundaries + pad silence]
        FILTER[Filter: 2s to 15s]
        SPLIT[Re-split segments over 10s]
        ENCODE[Encode to base64]
    end

    subgraph batch_prep [Batch Preparation]
        PROMPT[Build prompt per language]
        JSONL[Create JSONL files with 40K segments each]
        UPLOAD[Upload JSONL to File API]
    end

    subgraph inference [Inference Backends]
        BATCH[AI Studio Batch API]
        REALTIME[AI Studio Realtime API]
        OR[OpenRouter API]
    end

    subgraph postprocessing [Post-Processing]
        PARSE[Parse JSON responses]
        VALIDATE[Tier 1 validation + scoring]
        STORE[Store in Supabase]
        FLAG[Flag low-confidence segments]
    end

    R2 --> DL
    SB --> META
    DL --> META
    META --> TRIM
    TRIM --> FILTER
    FILTER --> SPLIT
    SPLIT --> ENCODE
    ENCODE --> PROMPT
    PROMPT --> JSONL
    JSONL --> UPLOAD
    UPLOAD --> BATCH
    ENCODE --> REALTIME
    ENCODE --> OR
    BATCH --> PARSE
    REALTIME --> PARSE
    OR --> PARSE
    PARSE --> VALIDATE
    VALIDATE --> STORE
    VALIDATE --> FLAG
    FLAG --> REALTIME
```

---

## Concern 16: Supabase Schema

New tables needed:

- **`transcription_results`**: video_id, segment_file, transcription, tagged, speaker_json, detected_language, quality_score, language_mismatch (bool), api_backend (batch/realtime/openrouter), token_usage, created_at
- **`transcription_jobs`**: job_id, batch_name, status, segment_count, completed_count, failed_count, created_at, completed_at
- **`transcription_flags`**: segment_id, flag_type (low_score/language_mismatch/empty/etc), resolved (bool)

---

## Implementation Modules

The pipeline will be built as a Python package with these modules:

- `config.py` - env vars, constants, language mappings
- `r2_client.py` - R2 download/extract
- `audio_preprocess.py` - trim, pad, length filter, re-split
- `prompt_builder.py` - parameterized prompt per language
- `gemini_batch.py` - JSONL creation, batch submission, polling
- `gemini_realtime.py` - async realtime API calls
- `openrouter_client.py` - OpenRouter fallback
- `validator.py` - Tier 1 validation + scoring
- `db.py` - Supabase read/write
- `pipeline.py` - orchestrator tying everything together
- `main.py` - CLI entry point