---
name: Validation Metrics Analysis
overview: Pull 7,470 parquet shards (54GB) from R2 to local, merge into a single analysis dataset, compute aggregate metrics, and bucket all segments into Golden/Redo/Dispose sets with summary statistics.
todos:
  - id: pull-shards
    content: Write and run scripts/pull_shards.py to download 7,470 parquet shards from R2 (drop vox_speaker_embedding to save space)
    status: in_progress
  - id: install-duckdb
    content: pip install duckdb in the venv
    status: in_progress
  - id: analyze-script
    content: Write scripts/analyze_validation.py with DuckDB queries for aggregate metrics + Golden/Redo/Dispose bucketing
    status: pending
  - id: run-analysis
    content: Run the analysis and report results + export CSV manifests
    status: pending
isProject: false
---

# Validation Metrics: Golden / Redo / Dispose Bucketing

## Data

- **7,470 parquet shards** in R2 bucket `validation-results` under `shards/<worker_id>/`
- **~54 GB** total, ~7.2 MB avg per shard, 27 columns per segment
- **503K videos**, estimated ~50M+ segments
- 369 GB free disk — plenty of room

## Approach

### Step 1: Download shards from R2

Download all 7,470 parquet shards to `/home/ubuntu/transcripts/data/validation_shards/` using parallel boto3 transfers. Drop `vox_speaker_embedding` (binary blobs, huge) during download to save ~30-40% space — not needed for metrics. Estimated download: ~30-35 GB after pruning, ~15-20 min at R2 speeds.

### Step 2: Merge into single analysis dataset

Use DuckDB (fast columnar engine) to read all shards via glob pattern and compute metrics in-place. No need to merge into one giant file — DuckDB reads parquet partitions natively.

```python
import duckdb
con = duckdb.connect()
con.execute("CREATE VIEW val AS SELECT * EXCLUDE(vox_speaker_embedding) FROM 'data/validation_shards/**/*.parquet'")
```

### Step 3: Compute aggregate metrics

Run SQL queries over the full dataset for:

- Total segments, total videos, total hours of audio
- Language distribution (by `consensus_lang`)
- LID consensus rate (2/3 vs 3/3 vs 0/3)
- MMS / VoxLingua confidence distributions
- Conformer CTC normalized distribution (histogram buckets)
- Gemini quality score distribution
- Per-language breakdown of all the above

### Step 4: Apply bucketing rules

**Golden** (production-ready, high-confidence):

- `lid_consensus = True` AND `lid_agree_count = 3`
- `conformer_multi_ctc_normalized >= 0.7` (OR NULL for English — no Conformer coverage)
- `gemini_quality_score >= 0.5`
- `duration_s >= 2.0`

**Redo** (salvageable, re-transcribe or review):

- `lid_consensus = True` AND `lid_agree_count >= 2`
- `conformer_multi_ctc_normalized` between 0.3 and 0.7 (or quality 0.3-0.5)
- NOT in Golden, NOT in Dispose

**Dispose** (not salvageable):

- `lid_consensus = False` (all 3 models disagree)
- OR `conformer_multi_ctc_normalized < 0.3` (terrible CTC match)
- OR `duration_s < 1.0` (too short to be useful)

### Step 5: Output

Write a single analysis script `scripts/analyze_validation.py` that:

1. Reads all shards via DuckDB
2. Prints summary metrics to stdout
3. Exports 3 CSV manifest files: `data/golden_segments.csv`, `data/redo_segments.csv`, `data/dispose_segments.csv` with `video_id, segment_file, consensus_lang, scores...`
4. Exports a per-video summary: `data/video_summary.csv` — one row per video with golden/redo/dispose segment counts, dominant language, avg scores
5. Prints a final report: total hours per bucket, language breakdown, overall quality distribution

### Dependencies

- `duckdb` — add to venv (pip install, not in requirements.txt since this is analysis-only)
- Everything else (pyarrow, boto3) already installed

## File Changes

- **New**: `scripts/analyze_validation.py` — main analysis script
- **New**: `scripts/pull_shards.py` — parallel R2 download with column pruning
- **Modified**: Nothing in the existing codebase