---
name: Validation Backfill Recovery
overview: Fix the split-packaging bug at the transcription-to-validation handoff, repair only the affected transcribed tar files, and run a targeted validation backfill that replaces affected videos at whole-video granularity instead of revalidating the full corpus.
todos:
  - id: baseline-counts
    content: Capture exact DB/R2 baseline counts and define per-video acceptance metrics for repair and validation coverage.
    status: pending
  - id: fix-packaging
    content: Fix split-safe tar packaging in src/pipeline.py and src/r2_client.py, and add regression tests for multi-split segments.
    status: pending
  - id: repair-historical-tars
    content: Build a repair job that regenerates split audio from raw tars and repacks corrected transcribed tars using existing transcription_results rows.
    status: pending
  - id: build-manifest
    content: Generate the affected-video manifest from split rows, missing/mismatched tar contents, and historical validation failures.
    status: pending
  - id: targeted-backfill
    content: Add targeted validation-run filtering plus run_id/provenance so workers validate only the affected manifest and write isolated shards.
    status: pending
  - id: merge-and-verify
    content: Merge old and new validation results by video_id, then verify final coverage against transcription_results before updating statuses.
    status: pending
isProject: false
---

# Validation Backfill Recovery

## Recommended Path

Use a targeted backfill, not a full 500k rerun.

The best path is:

- Fix the tar-packaging bug for future outputs.
- Repair historical `_transcribed.tar` files for affected videos without rerunning Gemini.
- Run a targeted validation backfill only on repaired / historically failed / missing videos.
- Merge by `video_id` at whole-video granularity so historical good validation outputs stay intact.

## Why This Is The Best Path

- The root bug is the packaging handoff, not the transcription text itself.
- Validation is video-atomic today, so segment-level skipping is not worth building first.
- Historical validation already covered many unaffected videos; the cheapest safe recovery is whole-video replacement for affected videos only.
- Revalidating all `503k+` transcribed videos would work, but it wastes compute and makes provenance/merge harder than necessary.

## Key Evidence In Code

- Split IDs are created in [src/pipeline.py](src/pipeline.py): `seg_id = f"{original_file}_split{split_index}"`.
- Packing falls back to original raw filenames in [src/pipeline.py](src/pipeline.py): `valid_paths = [path_index[s.trim_meta.original_file] ...]`.
- JSON naming collapses split outputs in [src/r2_client.py](src/r2_client.py): `json_name = Path(seg_name).stem + ".json"`.
- Validation reads tar members by FLAC/JSON stem in [validations/audio_loader.py](validations/audio_loader.py), so collapsed tar members directly reduce validation coverage.

```119:123:src/pipeline.py
seg_id = seg.trim_meta.original_file
if seg.trim_meta.was_split:
    seg_id = f"{seg.trim_meta.original_file}_split{seg.trim_meta.split_index}"
```

```192:194:src/pipeline.py
path_index = {p.name: p for p in extracted.segment_paths}
valid_paths = [path_index[s.trim_meta.original_file]
               for s in valid_segments if s.trim_meta.original_file in path_index]
```

```118:120:src/r2_client.py
for seg_name, result_json in transcription_jsons.items():
    json_name = Path(seg_name).stem + ".json"
```

## Execution Plan

### 1. Freeze A Baseline And Define Success Metrics

- Record exact current counts from DB and R2 before changes.
- Define acceptance criteria per affected video:
  - repaired tar `segments/*.flac` count equals expected split-safe segment count
  - repaired tar `transcriptions/*.json` count equals DB transcription row count for that `video_id`
  - post-backfill validation rows for that `video_id` match repaired tar-visible segments
- Treat coverage as complete when merged validation results cover all transcribed rows except explicit irrecoverable failures.

### 2. Fix Packaging For New Outputs

Change [src/pipeline.py](src/pipeline.py) and [src/r2_client.py](src/r2_client.py) so tar packaging preserves split identity.

- Pack from polished split-safe audio bytes, not original extracted raw paths.
- Use one canonical basename for both audio and JSON, shared across DB `segment_file`, tar FLAC name, and tar JSON name.
- Keep tar layout unchanged: `{video_id}/segments/*.flac`, `{video_id}/transcriptions/*.json`, `metadata.json`.
- Add regression tests around split outputs in [tests](tests) so one original segment producing multiple splits yields multiple FLACs and JSONs in tar.

### 3. Build A Tar-Repair Job For Historical Affected Videos

Create a repair flow that does not rerun Gemini.

- Input: raw tar from `1-cleaned-data`, existing `transcription_results` rows for that `video_id`.
- Re-run `audio_polish` only to deterministically regenerate split audio.
- Map regenerated segment IDs to existing DB transcriptions.
- Rebuild and upload corrected `_transcribed.tar` with split-safe FLAC and JSON names.
- Fail fast if regenerated segment IDs do not match DB transcription rows for that video.

Likely files:

- [src/audio_polish.py](src/audio_polish.py)
- [src/pipeline.py](src/pipeline.py)
- [src/r2_client.py](src/r2_client.py)
- a new repair entrypoint under [scripts](scripts) or [src](src)

### 4. Generate An Affected-Video Manifest

Target only videos that need repair / revalidation.

- Include all videos with `_split` rows in `transcription_results`.
- Include videos with missing or mismatched `_transcribed.tar` contents.
- Include historical `validation_failed` videos.
- Include videos with transcribed DB rows but missing tar-visible members.
- Store the manifest as a deterministic `video_id` list plus summary counts.

### 5. Add Targeted Validation Backfill Mode

Do not rely on global `validation_status='pending'` because current claim logic is global and video-atomic.

- Add a targeted claim/filter path to [validations/main.py](validations/main.py), [validations/config.py](validations/config.py), and [validations/worker.py](validations/worker.py).
- Feed workers only the manifest videos.
- Keep outputs isolated with a `run_id` and new shard prefix / worker namespace.
- Stamp provenance into parquet via [validations/packer.py](validations/packer.py).

### 6. Preserve Historical Good Validation Results

Merge at whole-video granularity, not segment granularity.

- Historical good videos remain untouched.
- For each affected `video_id`, replace old validation output with the new backfill output.
- Keep historical and backfill shards separable until counts reconcile.
- Update analysis scripts like [scripts/pull_shards.py](scripts/pull_shards.py) and [scripts/analyze_validation.py](scripts/analyze_validation.py) so they can filter by `run_id` / shard prefix and perform whole-video replacement during final analysis.

### 7. Verify Final Coverage Before Any Broad Status Updates

After backfill:

- Compare merged validated segment count against `transcription_results` count.
- Reconcile any remaining gap into explicit buckets:
  - irrecoverable/missing raw tar
  - transcription row without reconstructable split audio
  - true validation runtime failures
- Only then decide whether to restore/update DB `validation_status` fields for the recovered videos.

## Recommended Rollout Order

```mermaid
flowchart TD
    baseline[BaselineCounts] --> fixPack[FixPackaging]
    fixPack --> repairJob[RepairHistoricalTranscribedTars]
    repairJob --> manifest[AffectedVideoManifest]
    manifest --> targetedVal[TargetedValidationBackfill]
    targetedVal --> merge[WholeVideoMerge]
    merge --> verify[CoverageVerification]
    verify --> finalize[FinalizeStatusesAndReporting]
```



## Success Criteria

- New `_transcribed.tar` files preserve all split outputs.
- Targeted backfill validates every repaired transcribed segment for affected videos.
- Merged validation coverage reaches all transcribed segments except explicitly documented irrecoverable failures.
- Historical good validation outputs are preserved for unaffected videos.

