# Phase 2 Runbook: Full-Corpus 16kHz Mono Conversion + Training Artifacts

## Execution Plan (8-hour window)

### Key Facts
- **Only `final-export` needs conversion** (3551 shards, 48kHz → 16kHz)
- All other sources (587 shards) are already 16kHz — skip conversion
- 48kHz → 16kHz is exact 3:1 integer decimation (no resampler needed)
- Per-shard: ~3.3 GB in → ~1.1 GB out. Each shard FREES ~2.2 GB
- 128 CPUs available; 32 workers = ~3 hours for conversion
- In-place atomic swap: write .tmp, validate, swap, delete old

### Timeline

| Window | Task | Duration |
|---|---|---|
| 0:00-0:15 | Inventory scan (all 4138 shards) | 15 min |
| 0:15-0:30 | Dry-run on 3 shards + worker sweep | 15 min |
| 0:30-4:30 | Full conversion (3551 shards, 32 workers) | ~4 hours |
| 4:30-5:00 | Conversion validation report | 30 min |
| 5:00-5:30 | Global manifest build (all 4138 shards) | 30 min |
| 5:30-6:00 | Train/dev/test split (video_id-safe) | 30 min |
| 6:00-6:30 | Decode-only benchmark | 30 min |
| 6:30-7:00 | Bucket calibration | 30 min |
| 7:00-8:00 | Validation report + cleanup | 60 min |

### Thread Safety
```bash
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export TORCH_NUM_THREADS=1
```

### Disk Safety
- Free space at start: ~1.2 TB
- Each converted shard frees ~2.2 GB (3.3 GB old → 1.1 GB new)
- After 100 shards: ~1.2 TB + 220 GB = ~1.4 TB free
- After all 3551: ~1.2 TB + 7.8 TB = ~9 TB free
- Critical: validate BEFORE deleting old tar

### No Filtering
- ALL data is kept (no quality filtering at conversion stage)
- All bucket labels preserved in manifest
- Empty transcripts are kept with empty string (flagged, not dropped)

### Atomic Swap Protocol
```
1. Write audio_16k.tar.tmp alongside audio.tar
2. Validate: count match, all 16kHz, spot-check decode
3. Rename: audio.tar → audio_48k.bak
4. Rename: audio_16k.tar.tmp → audio.tar
5. Validate new audio.tar is readable
6. Delete: audio_48k.bak
7. Write: shard_conversion_status.json
```
If ANY step fails: leave audio.tar (original) intact.
