I’d split this into two questions: did the fleet produce the right tokens, and can the trainer consume them safely?

The biggest risks in this run are not generic codec quality issues. They’re run-level data integrity issues: orphan or duplicate shards around crashes/retries, mixed worker cohorts accidentally using different code/checkpoints, long-segment truncation regressions, and manifest builders accidentally including non-`DONE` rows like your manual `SKIPPED_BALANCE`.

## What To Validate
- Freeze the run ledger first. Snapshot final `encoding_videos` counts and make sure nothing is left in `CLAIMED`, `DOWNLOADING`, `PROCESSING`, `ENCODED`, or `PACKED`. For training, whitelist only `status='DONE'`; do not rely on “not FAILED”.
- Reconcile `encoding_videos` ↔ `encoding_shards` ↔ R2 objects. Every `DONE` row should have exactly one `shard_id`. Every shard row should point to an existing R2 object, and the R2 object size should match `encoding_shards.size_bytes`. Every final `video_id` should appear once in the train manifest. Do not build the dataset from raw R2 listing alone, because retries can leave orphan or duplicate shard objects.
- Run a full structural scan of every shard. `codecbench/pipeline/shard_packer.py` already gives you the invariants: `manifest.json` plus per-segment NPZs with `xcodec2`, `bicodec_semantic`, `bicodec_semantic_len`, `bicodec_global`, and `_meta`. Validate every shard is readable, counts match the manifest, dtypes are integer, `bicodec_global` has 32 tokens, and token lengths are consistent with `duration_s * 50`.
- Specifically check for the old truncation bug. Any segment with `duration_s > 6s` but only about 300 XCodec2 tokens is wrong. The stored `_meta.num_chunks` and token counts make this easy to detect.
- Do a stratified end-to-end re-encode audit on a sample of 500-1000 videos. Sample across language, duration, early/mid/late run, GPU family, and worker cohort. Re-download the source video, rerun extract/VAD/encode with the same 198K checkpoint, unpack the stored shard entry, and compare tokens exactly. This is the strongest proof the fleet really used the intended codepath, because the shard metadata does not currently store git SHA or checkpoint hash.
- For XCodec2, separate two checks. Stored-token vs fresh fast re-encode should be exact. Fast-vs-original baseline will not be exact, so that one should only be checked on a sample and compared against the drift envelope you already established in `scripts/validate_fast_encoder.py`. BiCodec should remain exact.
- Decode and listen to a smaller sample. Use the same style of sanity checks as `scripts/run_eval.py`: determinism, mel/STFT/SI-SDR-style metrics, plus human listening for 50-100 clips. Oversample stitched long segments (`num_chunks > 1`) and known problematic tail-end videos.

## Training-Set Validity
- Build one canonical global manifest only from the passed `DONE` set. Include `shard_key`, `video_id`, `language`, segment path, `start_s`, `end_s`, `duration_s`, token lengths, and `num_chunks`.
- Keep a quarantine list for anything unreadable, duplicated, orphaned, or statistically weird.
- Run the actual training dataloader over the full manifest once before training. A dataset is not “valid” until the trainer can scan it end-to-end without parse, shape, or length failures.
- Lock down the dual-codec representation now. The stored data keeps codecs separate, which is good, but training should have one explicit rule for how BiCodec semantic/global tokens are flattened or modeled. Don’t let that be implicit in training code.

## What I’d Do First
If you want the highest-confidence, lowest-effort first pass, do these in order:

1. Full DB/R2/shard reconciliation.
2. Full shard integrity scan.
3. Stratified exact re-encode sample.
4. Build the train manifest from only the passed rows/items.

The nice part is your token corpus is only about 195 GB, so a full structural audit is very feasible. Sampling should be reserved for source-to-token fidelity, not for basic storage correctness.

The repo already has most of the primitives you need in `codecbench/pipeline/shard_packer.py`, `codecbench/pipeline/worker.py`, `scripts/validate_fast_encoder.py`, and `scripts/run_eval.py`. The missing piece is a run-level verifier that ties them together and emits a clean train manifest plus a quarantine report.

I can implement that next as a verifier script that does:
- shard/R2/DB reconciliation
- duplicate/orphan detection
- full NPZ integrity scan
- final `train_manifest.parquet`
- optional sampled re-encode checker