# Gemini Transcription Pipeline

Distributed audio transcription system using **Gemini 3 Flash** via Google AI Studio. Processes 500K+ multilingual YouTube video segments across 12 Indic languages. Workers claim videos from a Supabase queue, download audio tars from Cloudflare R2, polish/split/transcribe segments, validate results, and upload output tars back to R2.

## Architecture

```
Supabase (video_queue)  ──claim──>  Worker (Docker container)
                                       │
                                       ├── R2: download {id}.tar
                                       ├── Audio Polish: split >10s, trim edges, pad silence
                                       ├── Gemini 3 Flash: transcribe all segments (batch, async)
                                       ├── Validator: quality score, lane flags
                                       ├── Supabase: insert transcription_results + flags
                                       ├── R2: upload {id}_transcribed.tar
                                       └── Supabase: mark video done
```

Each worker is a single Docker container running `python -m src.main`. Workers are stateless — all state lives in Supabase and R2.

## Project Structure

```
src/
  main.py              Entry point (Docker ENTRYPOINT)
  config.py            All tunables, env vars, language mappings
  worker.py            Worker lifecycle: register, heartbeat, main loop, prefetch, shutdown
  pipeline.py          Per-video orchestrator: download → polish → transcribe → pack → upload
  batch_cycle.py       Batch engine: fire requests, handle 429 flood, validate, build DB records
  cache_manager.py     Gemini explicit context caching (V2 prompt, 6-day TTL)
  audio_polish.py      4-step audio pipeline: split → boundary-trim → silence-pad → FLAC encode
  prompt_builder.py    V1/V2 system prompts, Pydantic JSON schema, user prompt templates
  validator.py         Tier 1 quality validation: chars/sec, script check, lang mismatch, etc.
  db.py                PostgreSQL client via asyncpg (+ MockDB for testing)
  r2_client.py         Cloudflare R2 client (download/extract/pack/upload tars)
  providers/
    base.py            Abstract provider interface + data classes
    aistudio.py        PRIMARY: direct Gemini REST API, V2 cache, 20s timeout, thinking override
    openrouter.py      SECONDARY: OpenRouter (kept for preflight tests, not used in production)

tests/                 73 unit/integration tests (pytest)
preflight/             Live validation scripts (canary, stress, determinism, dashboard)
```

## Gemini Keys (Multi-Key Pool)

Four GCP project API keys with identical quotas. Each key: **20K RPM, 20M TPM, unlimited RPD**.

```
GEMINI_KEY       = primary (GCP project 1)
GEMINI_PROJECT2  = secondary (GCP project 2)
GEMINI_PROJECT3  = tertiary (GCP project 3)
GEMINI_PROJECT4  = quaternary (GCP project 4)
```

**Combined capacity: 80K RPM, 80M TPM.**

At deployment, each worker gets assigned a key via `GEMINI_KEY_INDEX=0|1|2|3`. The next key in the pool is the 429 fallback. Each key gets its own explicit V2 context cache (auto-created on startup, 6-day TTL, 1036 tokens).

## Environment Variables

### Required

| Variable | Description |
|----------|-------------|
| `GEMINI_KEY` | Primary Gemini API key |
| `GEMINI_PROJECT2` | Second Gemini API key |
| `GEMINI_PROJECT3` | Third Gemini API key |
| `GEMINI_PROJECT4` | Fourth Gemini API key |
| `DATABASE_URL` | PostgreSQL connection string (direct PG, bypasses REST) |
| `URL` | Supabase REST API URL (dashboard only) |
| `SUPABASE_ADMIN` | Supabase service_role JWT (dashboard only) |
| `R2_ENDPOINT_URL` | Cloudflare R2 S3-compatible endpoint |
| `R2_ACCESS_KEY_ID` | R2 access key |
| `R2_SECRET_ACCESS_KEY` | R2 secret key |
| `R2_BUCKET` | R2 bucket name (default: `1-cleaned-data`) |

### Deployment Controls

| Variable | Default | Description |
|----------|---------|-------------|
| `GEMINI_KEY_INDEX` | `0` | Which key this worker uses as primary (0, 1, 2, or 3) |
| `WORKER_ID` | auto UUID | Unique worker identifier |
| `GPU_TYPE` | `unknown` | Label for Supabase dashboard |
| `MAX_VIDEOS` | `0` | Stop after N videos (0 = unlimited, for test runs) |
| `WORKER_BATCH_SIZE` | `1000` | Max segments per API batch |
| `BATCH_INTERVAL_SECONDS` | `60` | Min seconds between batches (rate limiting) |
| `TEMPERATURE` | `0` | Gemini temperature |
| `THINKING_LEVEL` | `low` | Gemini thinking level (LOW/MINIMAL) |
| `MOCK_MODE` | `false` | Run without real APIs |

## Supabase Schema

### video_queue
Claim queue for workers. 507K videos seeded across 12 languages.

| Column | Type | Description |
|--------|------|-------------|
| video_id | text PK | YouTube video ID |
| status | text | pending / claimed / done / failed |
| language | text | ISO 639-1 code (ta, hi, te, etc.) |
| segment_count | int | Number of raw segments in tar |
| claimed_by | text | Worker ID that claimed it |
| claimed_at | timestamptz | When claimed |
| completed_at | timestamptz | When finished |
| error_message | text | Error details if failed |

### workers
Live worker status and stats. Updated via heartbeat every 10s.

| Column | Type | Description |
|--------|------|-------------|
| worker_id | text PK | Unique worker ID |
| status | text | online / offline / error |
| provider | text | Always "aistudio" |
| gpu_type | text | GPU label |
| config_json | jsonb | Prompt/schema versions, temperature, thinking level |
| total_segments_sent/completed/failed/429 | int | Cumulative counts |
| total_cache_hits | int | Segments served from V2 cache |
| total_input_tokens / output_tokens / cached_tokens | int | For cost tracking |
| batches_completed | int | Total batch count |
| current_video_id | text | Video being processed |
| segments_remaining | int | In current video |
| active_rpm / active_tpm | float | Live throughput |

### transcription_results
One row per transcribed segment.

| Column | Type | Description |
|--------|------|-------------|
| id | uuid PK | |
| video_id | text | Source video |
| segment_file | text | Segment filename |
| expected_language_hint | text | Language from queue |
| detected_language | text | Language model actually heard |
| lang_mismatch_flag | bool | expected != detected |
| transcription | text | Verbatim native-script transcription |
| tagged | text | Same with event tags inserted |
| speaker_emotion / style / pace / accent | text | Prosody metadata |
| quality_score | float | 0-1 composite quality |
| asr_eligible / tts_clean_eligible / tts_expressive_eligible | bool | Downstream lane flags |
| prompt_version / schema_version / model_id / temperature | text | Reproducibility metadata |
| cache_hit | bool | Whether V2 cache was used |
| token_usage_json | jsonb | input/output/cached token counts |

### transcription_flags
Quality flags for flagged segments (errors, timeouts, validation issues).

## Audio Polish Pipeline

4-step process defined in `audio_polish.py`:

1. **Length split**: Segments >10s split at silence valleys (search from 7s). Hard upper bound 15s.
2. **Boundary trim**: Check first/last 50ms RMS, trim dirty edges at nearest silence valley.
3. **Silence pad**: 150ms silence prepended and appended.
4. **FLAC encode**: Convert to FLAC bytes + base64 for API.

Segments <2s are discarded at any stage. All valid segments are 2s–15s, most ≤10s.

## Prompt System

**V2 (production)**: Language-agnostic cacheable system prompt (~1036 tokens) with:
- Verbatim transcription rules (no translation, no correction, no hallucination)
- Code-mixed handling (each language in its native script)
- Prosody-based punctuation only
- 6 reference examples (Telugu+English code-mix, Hindi+English, no-speech, abrupt cutoff, event tags, language mismatch)
- 10 audio event tags
- Accent guidance: sub-regional dialect only, not language name

Per-request user prompt carries the language hint: `TARGET LANGUAGE: Tamil (ta)`.

JSON schema enforced via API `responseJsonSchema` (not in prompt text).

## Timeout & Retry Strategy

| Scenario | Behavior |
|----------|----------|
| Normal response | 2-5s latency |
| Read timeout (20s) | Retry once with `thinking=MINIMAL` (avoids re-entering thinking loop) |
| Second timeout (20s) | Discard segment as bad sample. Max 40s waste. |
| 429 rate limit | Exponential backoff: 10s, 20s, 40s (3 retries) |
| 429 flood (>10% of batch) | Entire batch of 429'd segments retried on fallback key |

## R2 Storage Layout

```
Input:   s3://1-cleaned-data/{video_id}.tar
         s3://1-cleaned-data/cleaned/trail/{video_id}.tar  (alternate prefix)

Output:  s3://1-cleaned-data/transcribed/{video_id}_transcribed.tar
```

Output tar structure:
```
{video_id}/
  metadata.json           Original metadata + transcription_summary
  segments/               Polished FLAC audio (valid segments only)
  transcriptions/         1:1 JSON per segment (transcription, tagged, speaker, detected_language)
```

## Docker

```bash
# Build
docker build -t bharathkumar192/transcription-worker:v8 .

# Run tests inside container
docker run --rm --entrypoint python bharathkumar192/transcription-worker:v8 -m pytest tests/ -x -q

# Test run (2 videos, single key)
docker run --rm --env-file .env \
  -e MAX_VIDEOS=2 \
  -e GEMINI_KEY_INDEX=0 \
  -e WORKER_ID=test-worker \
  bharathkumar192/transcription-worker:v8

# Production (unlimited videos)
docker run --rm --env-file .env \
  -e GEMINI_KEY_INDEX=0 \
  -e WORKER_ID=worker-001 \
  -e GPU_TYPE=RTX_4090 \
  bharathkumar192/transcription-worker:v8
```

CLI args: `--mock`, `--worker-id ID`, `--gpu-type TYPE`, `--key-index 0|1|2|3`, `--max-videos N`

## vast.ai Deployment

**Target**: 50 workers across 4 keys (~12 workers per key).

**Docker image**: `bharathkumar192/transcription-worker:v8`

Per worker: assign `GEMINI_KEY_INDEX` in round-robin (0,1,2,3,0,1,2,3,...), unique `WORKER_ID`, set `GPU_TYPE` to the instance GPU.

```bash
# Deploy on vast.ai (RTX 4090/3090, 4+ vCPUs)
vastai create instance <offer_id> \
  --image bharathkumar192/transcription-worker:v8 \
  --disk 50 --ssh --direct \
  --login '-u bharathkumar192 -p <DOCKER_PAT> docker.io' \
  --env "-e R2_ENDPOINT_URL=... -e DATABASE_URL=... -e GEMINI_KEY_INDEX=0 -e WORKER_ID=v8-prod-001 -e GPU_TYPE=RTX_4090 ..." \
  --onstart-cmd 'cd /app && python -m src.main >> /var/log/worker.log 2>&1 &' \
  --label "v8-prod-001"
```

**Capacity math**:
- 4 keys x 20K RPM = 80K RPM total
- 50 workers x ~600 RPM avg = ~30K RPM (well within 80K limit)
- 507K videos in queue, ~207 segments avg = ~105M segments total

**Live vast.ai v8 deployment** (2026-03-01): 48 RTX 3090/4090 workers, global
- 179K segments processed in first 15 min: 0 429s, 0 failures, 100% cache hits
- Combined RPM: ~31K, ETA ~2 days for remaining 443K videos
- Direct PostgreSQL via asyncpg: zero DB timeouts (was the v4-v7 killer)

**v4→v8 fix history**:
- v4: parallelized audio polish + async IO (2x throughput), but bulk inserts hit Supabase statement timeout
- v5-v6: chunked inserts, reduced chunk size — still overwhelmed PostgREST at 45+ workers
- v7: retry on register/heartbeat — PostgREST REST API (HTTP 000 timeout) buckled under load
- **v8: replaced Supabase REST entirely with direct PostgreSQL (asyncpg)**. Atomic claims via FOR UPDATE SKIP LOCKED, connection pooling, executemany batch inserts, heartbeat jitter. Zero DB issues at scale.

**Monitoring**: Direct PostgreSQL queries on `workers` and `video_queue` tables. Workers log `[GEMINI-STATS]` (429/500/timeout counters) and `[DB-STATS]` (insert/claim/heartbeat counters) periodically.

## Test Results (Verified)

| Test | Result |
|------|--------|
| Unit tests | 73/73 pass (inside Docker container) |
| 12-language determinism (AI Studio) | 90% exact match across 3 runs |
| 500-concurrency stress | 500/500 success, 0 errors, 0 429s, 100% cache hits |
| 1000-segment stress | 1000/1000 success, 74.9 req/s |
| Multi-key consistency (3 GCP keys) | All 3 keys: 80-90% determinism, identical quality |
| Live 2-video Docker E2E | 419/420 segments (99.8%), 62s wall time, 100% cache hits |
| Prefetch | Verified: video 2 tar pre-downloaded during video 1 processing, zero duplicate downloads |
| CAS claim lock | Verified: no race conditions in Supabase claim |
| Vast.ai RTX 4090 E2E | 685/685 segments (100%), 909 RPM, 0 failures, 0 429s |

## Video Queue Distribution

| Language | Code | Videos |
|----------|------|--------|
| Telugu | te | 71K |
| Malayalam | ml | 65K |
| English | en | 60K |
| Hindi | hi | 59K |
| Punjabi | pa | 56K |
| Tamil | ta | 50K |
| Kannada | kn | 46K |
| Gujarati | gu | 41K |
| Bengali | bn | 20K |
| Odia | or | 19K |
| Marathi | mr | 16K |
| Assamese | as | 4K |
| **Total** | | **507K** |

## Cost Tracking

Token usage is tracked per-worker in Supabase (`total_input_tokens`, `total_output_tokens`, `total_cached_tokens`). Cost formula:

```
cost = (input_tokens - cached_tokens) * input_rate
     + cached_tokens * input_rate * 0.25
     + output_tokens * output_rate
```

Gemini 3 Flash pricing applies. With V2 caching (~84% cache hit rate by token count), input costs are significantly reduced.

## Live Dashboard

**URL**: https://dashboard-theta-lovat-45.vercel.app

Next.js dashboard deployed on Vercel. Auto-refreshes every 15s. Pulls live data from all 4 Supabase tables:

- **Aggregate overview**: Videos done/pending/failed, segments processed, tokens used, estimated cost, ETA
- **Worker heartbeats**: Status grid with live RPM, segments completed, GPU type, heartbeat staleness
- **Per-worker detail**: Click any worker for full breakdown — segments, cache hits, tokens, cost, config, errors
- **Language progress**: Per-language progress bars across all 12 Indic languages (507K videos)
- **Queue chart**: Pie chart of pending/claimed/done/failed
- **Errors & flags**: Failed videos with error messages, flag type breakdown

Source: `dashboard/` directory. Environment: `NEXT_PUBLIC_SUPABASE_URL` + `SUPABASE_SERVICE_KEY`.

## Validation Pipeline (LID + CTC Scoring)

**Status**: Fleet-ready. Docker image + Vast.ai deploy script complete.

Multi-model validation that runs every segment through 4 models to produce rich per-segment quality data:

1. **MMS LID-256** (Meta, 1B params): Language classification — 256 languages, ~21 segs/s on A100
2. **VoxLingua107** (SpeechBrain, 14M params): Second LID opinion + 256-dim speaker embeddings, ~78 segs/s
3. **IndicConformer 600M** (AI4Bharat, multilingual): CTC ASR on 22 Indic languages + CER scoring vs Gemini, ~2.4 segs/s
4. **English CTC** (facebook/wav2vec2-large, 315M params): English CTC log-likelihood scoring, ~18 segs/s

**Output**: Parquet shards (50 videos each) with 27 columns per segment — all model scores, confidences, top-3 predictions, speaker embeddings, transcription comparison, LID consensus. Designed for offline bucketing into Golden/Redo/Discard sets.

```bash
# Test on a single video (downloads from R2, runs all models):
python -m validations.main --test-local="VIDEO_ID" --models all

# Run worker loop (claims from DB, processes, packs parquet):
python -m validations.main --max-videos 10
```

**VRAM**: 4.0GB for all 4 models on A100 (fits easily on 3090 24GB).
**Throughput**: ~2 segs/s total (conformer bottleneck), ~50 segments per video ≈ 25s/video.

Source: `validations/` directory.

### Validation Supabase Schema

Uses a separate `validation_status` column on `video_queue` (keeps transcription `status='done'` untouched):
- States: `pending` → `validating` → `validated` / `validation_failed`
- Claim: `WHERE status = 'done' AND validation_status = 'pending' ORDER BY video_id FOR UPDATE SKIP LOCKED`
- Index: `idx_vq_val_claim` partial index on `(video_id) WHERE validation_status='pending' AND status='done'`
  - ORDER BY video_id forces planner to use the partial index (105x faster than seq scan)
- Dashboard indexes: `idx_vq_status_lang (status, language)`, `idx_vq_done_completed (completed_at) WHERE status='done'`

`worker_validators` table tracks live fleet: worker_id, status, gpu_type, videos/segments processed, avg throughput, shards written, heartbeat, last_error. Dashboard can query for stale workers (heartbeat > 2min ago).

### Validation Docker

```bash
# Build (models baked in, ~10GB image)
docker build -f Dockerfile.validation -t bharathkumar192/validation-worker:latest \
  --build-arg HF_TOKEN=$HF_TOKEN .

# Test locally
docker run --rm --gpus all --env-file .env \
  -e MAX_VIDEOS=2 -e WORKER_ID=test-val \
  bharathkumar192/validation-worker:latest

# Push to Docker Hub
docker push bharathkumar192/validation-worker:latest
```

### Validation Vast.ai Deployment

```bash
# Deploy 50 workers
source .env && ./deploy_validation.sh 50

# Dry run (preview)
source .env && ./deploy_validation.sh 10 --dry-run

# Monitor
vastai show instances
```

**Capacity math (150 GPUs, post DB-optimization)**:
- ~2 segs/s per GPU, ~50 segments/video ≈ 25s/video
- 150 GPUs: ~6 videos/s → 465K remaining in ~21 hours
- DB fixes (2026-03-04): claim query 15.8ms→0.15ms (105x), heartbeat 30s→60s, pool 6→3 conns/worker

**R2 buckets**: Source=`transcribed` (507K tars), Output=`validation-results` (parquet shards).

## Recover Validation Pipeline (2026-03-07)

### Why

The original validation pipeline (above) processed ~507K videos through 4 GPU models (MMS, VoxLingua, Conformer, Wav2Vec) to produce per-segment quality scores. After that run completed, we discovered that `audio_polish` (the segment splitting/trimming step) is **non-deterministic** — it produces different child segment IDs on different machines due to floating-point boundary detection. This meant the validation parquet shards referenced segment IDs that didn't match the original transcription DB rows, making downstream join/bucketing unreliable.

### What We Recovered

Instead of re-running 507K videos from scratch, we built a **recover pipeline** that:

1. **Replays `audio_polish` deterministically** on each video's raw tar to regenerate the exact child segment IDs
2. **Matches replayed IDs to historical `transcription_results`** rows in Supabase to find which segments already had transcriptions
3. **Validates only the segments that need it** — segments already validated in the prior run get their results replayed from R2-hosted parquet snapshots (no GPU work), while gap segments (new IDs from the deterministic replay) get full 4-model GPU validation
4. **Uploads one parquet shard per video** to R2 with all 27 validation columns

This approach preserved ~60% of prior GPU work via replay while filling the gaps caused by non-deterministic segment IDs.

### Reference Data

To avoid hitting Supabase for every video's `transcription_results` and `transcription_flags`, workers download dated parquet snapshots from R2 once at startup and query them locally via DuckDB:

- `reference-data/20260307/transcription_results_recover.parquet` — all historical transcription rows
- `reference-data/20260307/transcription_flags_recover.parquet` — all historical flag rows  
- `reference-data/20260307/validated_segment_ids.parquet` — segment IDs already validated in prior run
- `reference-data/20260307/recover_reference_manifest.json` — row counts for verification

Source: `validations/recover_reference_store.py` (DuckDB-backed local store with multipart S3 download).

### Fleet & Execution

- **Queue**: `validation_recover_queue_20260307` — 497,075 videos
- **Image**: `bharathkumar192/validation-worker:recover-20260307-ledger`
- **Deploy script**: `scripts/deploy_recover_validation.sh`
- **Scaled to 300 GPUs** on Vast.ai across RTX 4090, 3090, 3090 Ti, 4090D, A6000, A5000, 4080
- Multiple scale-up waves (50 → 100 → 200 → 300) with continuous dead-worker detection and replacement
- Fleet completed the full queue in approximately **10 hours wall-clock**

### Final Results

| Metric | Count |
|--------|-------|
| Videos recovered | **497,068** (99.999%) |
| Videos failed | **7** |
| Total tx segments processed | **76,707,278** |
| Newly validated segments (GPU work) | **36,944,013** |
| Replayed segments (from prior run) | **76,772,363** |
| R2 parquet shards | **435,231** |
| R2 total output size | **47.82 GB** |
| Workers that contributed | **305** |

### Output Location

All output parquet shards are in R2 bucket `validationsrecoverfinal` under:
```
s3://validationsrecoverfinal/shards/{worker_id}/validation_{worker_id}_shard_NNNN.parquet
```

Each shard contains one video's worth of segments with the full 27-column validation schema (same as original validation pipeline — see `validations/packer.py` for schema).

### Recover-Specific Code

```
validations/
  recover_worker.py          Recover worker: replay → match → validate gaps → flush shard
  recover_loader.py          Deterministic audio_polish replay + segment ID matching
  recover_reference_store.py DuckDB-backed R2 parquet snapshot store
  recover_replay_ledger.py   Per-video ledger sidecar (tracks what was replayed vs regenerated)

scripts/
  deploy_recover_validation.sh  Vast.ai deploy script with GPU-specific batch size tuning
```

## Validation Audit UI

A local utility for inspecting low-performing videos segment by segment: replay audio polishing, view transcriptions, and compare validation scores.

```bash
# Activate venv, then:
python scripts/validation_audit_ui.py [--port 5111]
# Opens at http://localhost:<port>
```

The UI shows per-segment audio playback, transcription text, LID agreement, CTC scores, bucket classification, and split metadata. Audit data lives in `data/validation_audit/audit_output/`.

## Hummingbird Bucket Analysis (R2, 2026-03-11)

Full inventory of the `hummingbird` R2 bucket — raw podcast/audiobook corpus before transcription pipeline.

### Storage Summary

| Metric | Value |
|--------|-------|
| Total Objects | 854,366 |
| Total Storage | 4.17 TB |
| Average File Size | 5.1 MB |
| Total Audio Files | 347,889 (mp4 + mp3 + ts) |
| Total Audio Storage | ~4.17 TB |
| JSON Metadata Files | 348,493 |
| Subtitle (SRT) Files | 157,983 |

### File Types

| Extension | Files | Size | % of Storage |
|-----------|------:|-----:|:---:|
| .mp4 (audio/video) | 262,741 | 3.24 TB | 77.6% |
| .mp3 (audio) | 82,566 | 906 GB | 21.2% |
| .ts (transport stream) | 2,582 | 44.9 GB | 1.1% |
| .srt (subtitles) | 157,983 | 3.75 GB | 0.1% |
| .json (metadata) | 348,493 | 228 MB | <0.1% |
| .gz (compressed) | 1 | 8 MB | <0.1% |

### Estimated Audio Duration

| Metric | Value |
|--------|-------|
| Total Episodes | ~426,846 |
| Total Duration | ~107,500 hours (~4,480 days / ~12.3 years) |
| Average Episode Length | ~907 seconds (~15.1 min) |

Duration estimated by sampling 100 JSON metadata files per source prefix and extrapolating.

### Language Breakdown

| Language | Files | Audio Files | Storage | Est Hours | % |
|----------|------:|------------:|--------:|----------:|--:|
| Hindi | 294,093 | ~147,046 | 1.20 TB | ~30,934 | 28.8% |
| Tamil | 143,571 | ~71,785 | 616 GB | ~15,514 | 14.4% |
| Malayalam | 83,128 | ~41,564 | 531 GB | ~13,364 | 12.4% |
| Marathi | 52,540 | ~26,270 | 434 GB | ~10,921 | 10.2% |
| Telugu | 90,020 | ~45,010 | 355 GB | ~8,932 | 8.3% |
| Bengali | 41,739 | ~20,869 | 286 GB | ~7,194 | 6.7% |
| Kannada | 56,231 | ~28,115 | 283 GB | ~7,134 | 6.6% |
| English | 57,398 | ~28,699 | 271 GB | ~6,833 | 6.4% |
| Gujarati | 20,278 | ~10,139 | 128 GB | ~3,225 | 3.0% |
| Assamese | 7,577 | ~3,788 | 74 GB | ~1,872 | 1.7% |
| Oriya | 2,305 | ~1,152 | 30 GB | ~755 | 0.7% |
| Bhojpuri | 994 | ~497 | 7 GB | ~168 | 0.2% |
| Spanish | 490 | ~245 | 3 GB | ~78 | 0.1% |
| German | 332 | ~166 | 2 GB | ~59 | 0.1% |
| French | 94 | ~47 | <1 GB | ~12 | <0.1% |

### Data Sources (7 prefixes)

| Prefix | Objects | Storage | Format | Source |
|--------|--------:|--------:|--------|--------|
| humming0_part1/ | 561,895 | 2.55 TB | mp4+json+srt | Kuku FM audiobooks (show/episode hierarchy) |
| humming0/ | 121,058 | 717 GB | mp4+json+srt | Kuku FM audiobooks (flat episode layout) |
| humming3/ | 123,384 | 674 GB | mp3+json | Pratilipi audiobooks |
| humming4/ | 31,429 | 122 GB | mp3+json | Podcast aggregator (Aawaz) |
| humming2/ | 11,028 | 108 GB | mp3+json | Podcast platform (Earshot/similar) |
| humming5/ | 4,896 | 37 GB | ts+json | PocketFM (transport streams) |
| samples/ | 675 | 2.24 GB | mp3+mp4 | Sample clips |

Plus 1 root file: `kuku_metadata.tar.gz` (8 MB master metadata archive).

### Metadata Schemas

Each audio file has a companion JSON with episode-level metadata. Key fields vary by source:

- **humming0/humming0_part1**: `episode_id`, `show_id`, `title`, `duration_s`, `language`, HLS URLs, `has_srt`
- **humming2**: `episode_id`, `series_id`, `series_title`, `title`, `duration_s`, `language`, `created_at`
- **humming3**: `episodeId`, `title`, `duration`, `audioUrl`, `language`, `playCount`, `fileSize`
- **humming4**: `episodeSlug`, `showSlug`, `title`, `durationSecs`, `audioUrl`, `language`, `publishedOn`
- **humming5**: `story_id`, `show_id`, `title`, `duration`, `language`, `source` (pocketfm), `create_time`

### SRT Subtitles

157,983 SRT files (3.75 GB) — primarily in `humming0/`, containing timestamped native-script subtitles. Standard SRT format with ms-precision timestamps. ~46% of episodes in the Kuku FM prefixes have matching SRTs.

## Deferred / Not Implemented

- **Music/jingle detection**: Proactive skip of non-speech segments before sending to Gemini. Currently these timeout and get discarded (40s max waste each).
- **OpenRouter fallback**: Code exists in `src/providers/openrouter.py` but not used in production. OpenRouter doesn't support caching for Gemini 3 Flash (4096 token minimum, our prompt is 1036). Kept for preflight comparative tests only.

## Transcript Variant Prompt Validation (2026-03-11)

Added a separate text-only Gemini path for converting existing transcripts into two forms for downstream cleanup:
- `native_script_text`: same utterance rendered in the target native script
- `romanized_text`: same utterance rendered in plain ASCII Romanization

Rules implemented in `src/transcript_variant_prompt.py`:
- **Fully Roman input** is skipped locally before Gemini calls (`fully_roman_local_skip`)
- **Fully native** and **mixed native+Latin** inputs use the same cached Gemini prompt
- **Numerics, emails, URLs, handles, hashtags, and file-like tokens** are preserved exactly in both outputs
- Response format is enforced with Gemini structured output (`results[]` JSON object per input item)

Live validation entrypoint: `scripts/test_transcript_variant_prompt.py`
- Prompt cache size: **1036 tokens** (just above Gemini Flash's 1024-token caching threshold)
- 100-item / 5-run validation on key 0: **85 Gemini items + 15 local skips, 98.82% determinism, 0 schema errors, 0 validation errors**
- Single non-deterministic item: Bengali romanization variance (`Aajker` vs `Ajker`) — acceptable
- Cross-key smoke on keys 1/2/3 (20 items / 2 runs each): **100% exact determinism on all three keys**

### Gemini Request Parameters

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Model | `gemini-3-flash-preview` | Cheapest 3-series, 1024-token cache threshold |
| Temperature | `0` | Max determinism for text normalization |
| Thinking level | `LOW` | No complex reasoning needed, minimizes latency |
| Response format | `application/json` + `responseJsonSchema` | Strict structured output |
| Explicit cache | 1036-token system prompt, 6-day TTL | Cached at 25% input cost |

**Input style**: Each request sends 10 transcript items as a JSON array in the user message. Each item has `id`, `language_code`, `input_script_profile`, and `text`. The cached system prompt contains all rules and examples.

**Output fields per item** (enforced by schema):
- `id` — echoed back from input
- `language_code` — echoed back
- `input_script_profile` — echoed back (`fully_native`, `mixed_native_latin`, or `other`)
- `native_script_text` — utterance in the target native script
- `romanized_text` — utterance in plain ASCII Roman letters

### Transcript Variant Fleet Pipeline (2026-03-11)

Production shard-worker pipeline for converting transcript text at scale.

**Architecture**: Shard-based claim queue in PostgreSQL, parquet input/output via R2, packed uploads.

```
PostgreSQL (transcript_variant_job_queue)  ──claim──>  Worker
                                                          │
                                                          ├── R2: download input shard parquet
                                                          ├── Local: classify scripts, skip fully_roman
                                                          ├── Gemini 3 Flash: batch 10 items/request
                                                          ├── Pack: buffer rows, flush parquet packs
                                                          ├── R2: upload output packs
                                                          ├── PostgreSQL: update progress + manifests
                                                          └── PostgreSQL: mark shard done
```

**New source files**:
```
src/
  variant_main.py         Entrypoint: python -m src.variant_main
  variant_worker.py       Worker lifecycle: claim, process, pack, upload, heartbeat
  variant_provider.py     Gemini client + cache manager for variant prompt
  variant_db.py           PostgreSQL tables + claim/heartbeat/manifest ops
  variant_r2.py           R2 upload/download for shard input + packed output

scripts/
  build_transcript_variant_test_input.py   Generate test parquet from samples
  init_transcript_variant_queue.py         Shard input parquet, upload to R2, seed queue
```

**PostgreSQL tables** (auto-created by worker on startup):
- `transcript_variant_job_queue` — one row per input shard, claim/progress/completion tracking
- `transcript_variant_workers` — live worker heartbeats, cumulative metrics
- `transcript_variant_pack_manifests` — one row per uploaded output pack

**Local validation** (verified):
- Seeded 4 test shards from 240-row sample parquet
- Worker claimed shard, processed 60 rows (51 Gemini + 9 skipped), uploaded 2 packs to R2
- DB metrics correct: `rows_processed=60, rows_gemini=51, rows_skipped=9, packs_uploaded=2, requests_succeeded=6, total_cache_hits=6`

**Docker validation** (verified):
- Same image, same queue, claimed next shard, identical behavior and metrics
- Entrypoint: `python -m src.variant_main`

**Production run completed (2026-03-11)**:

| Metric | Value |
|--------|-------|
| Input rows | 60,716,547 |
| Shards completed | 1,024 / 1,024 (100%) |
| Shards failed | 0 |
| Rows sent to Gemini | 43,168,424 (71.1%) |
| Rows skipped locally (fully Roman) | 17,548,123 (28.9%) |
| Gemini requests | ~3.5M (0.08% failure rate, 100% cache hits) |
| R2 output packs | 10,191 |
| R2 output size | 16.05 GB |
| Workers used | 85 (RTX 4090, US/Canada) |
| Peak throughput | ~19K RPM, 280K rows/min |
| Wall-clock time | ~3 hours |

**Deploy script**: `scripts/deploy_variant.sh`
```bash
# Deploy 25 workers across 4 Gemini keys (round-robin)
./scripts/deploy_variant.sh --concurrent 100 <offer_id1> <offer_id2> ...

# Dry run
./scripts/deploy_variant.sh --dry-run --concurrent 100 <offer_ids>
```

**Docker image**: `bharathkumar192/variant-worker:v2`

**Scaling notes**:
- Standard concurrency (20): ~200 RPM/worker
- Aggressive concurrency (100): ~2,000 RPM/worker (10x improvement)
- 4 Gemini keys at 20K RPM each = 80K RPM ceiling
- Staged deployment: 1 → 5 → 25 → 73 standard → +12 aggressive (total 85)

**R2 layout**:
```
s3://1-cleaned-data/transcript-variants/production-v1/input/shard_NNNNNN.parquet
s3://1-cleaned-data/transcript-variants/production-v1/output/shard_NNNNNN/shard_NNNNNN_pack_PPPP.parquet
```

**Init script bug fix**: `scripts/init_transcript_variant_queue.py` — DuckDB COPY with PARTITION_BY may produce multiple parquet files per shard directory. Fixed to merge them into a single file before upload.

Sample test data lives in `data/transcript_variant_prompt_samples.json`. Result snapshots:
- `final_data/transcript_variant_prompt_100x5.json`
- `final_data/transcript_variant_prompt_reduced_smoke.json`