# CodecBench

Benchmark + validation pipeline for neural audio codecs.

## Supported Codecs

| Codec | SR | TPS | Token Structure | Package |
|-------|---:|----:|-----------------|---------|
| XCodec2 | 16k | ~50 | Single VQ [B, T] | `pip install xcodec2` |
| XCodec2 Fast | 16k | ~50 | Single VQ [B, T] | (optimized wrapper) |
| BiCodec (Spark-TTS) | 16k | ~50 | Semantic + Global dict | Spark-TTS repo |
| BiCodec Fast | 16k | ~50 | Semantic + Global dict | (optimized wrapper) |

## Install

```bash
python -m venv venv && source venv/bin/activate
pip install -e ".[all]"
```

## Project Structure

```
codecbench/
  codecs/
    base.py              # TokenBatch + NeuralCodec protocol
    xcodec2.py           # Original XCodec2 wrapper
    xcodec2_fast.py      # Optimized: GPU mel, 16L truncation, SDPA, TF32
    bicodec.py           # Original BiCodec wrapper
    bicodec_fast.py      # Optimized: 17L truncation, SDPA, TF32
  audio/
    io.py                # Load, resample, normalize
    batching.py          # Fixed-window cropping, padding
  bench/
    timer.py             # CUDA event timing + stats
    runner.py            # Benchmark orchestration
  serialize/
    tokens.py            # NPZ/zstd token storage
  pipeline/
    config.py            # All pipeline config (R2, Supabase, VAD, codec, worker)
    r2_client.py         # R2 download/upload + async video prefetcher
    supabase_client.py   # Supabase orchestration (tables, claims, heartbeats)
    vad.py               # Silero-VAD segmentation (2-30s segments)
    encoder.py           # Hot GPU encoder (XCodec2 + BiCodec parallel streams)
    shard_packer.py      # Token packing → tar.zst shards
    heartbeat.py         # Worker heartbeat thread
    worker.py            # Main pipeline orchestrator
    cli.py               # CLI entry point (ingest, run, stats, bench)
scripts/
  run_bench.py           # Throughput benchmark CLI
  run_eval.py            # Quality evaluation CLI
  smoke_test.py          # Quick sanity checks
  validate_fast_encoder.py  # XCodec2 600-sample validation
  download_and_bench.py  # Download + segment + benchmark
repos/
  Spark-TTS/             # Spark-TTS source (for BiCodec modules)
  Spark-TTS-0.5B/        # BiCodec + wav2vec2-xlsr-53 checkpoints
```

## Production Pipeline

### Dataset

| Source | Bucket | Videos | Hours | Languages |
|--------|--------|-------:|------:|-----------|
| English pretrain | pt-english | 402,479 | 160,507 | english |
| Indic pretrain | pt-indic | 2,916,470 | 1,287,153 | hindi, telugu, tamil, marathi, punjabi, malayalam, kannada, gujarati, bengali, assamese, odia |
| **Total** | | **3,318,949** | **1,447,660** | **12 languages** |

### Pipeline Architecture

```
R2 (video) → ffmpeg → VAD → GPU encode (XCodec2 + BiCodec) → shard pack → R2 (tokens)
     ↑                                                              ↓
  Supabase ←←←←←←←←← heartbeat + status updates ←←←←←←←←←←←←←←←←←
```

**Per-video lifecycle**: `PENDING → CLAIMED → DOWNLOADING → PROCESSING → ENCODED → PACKED → DONE`

### Quick Start

```bash
# 1. Setup Supabase tables + ingest video metadata
python -m codecbench.pipeline.cli setup-db
python -m codecbench.pipeline.cli ingest metafiles/english_pretrain.csv metafiles/indic_pretrain.csv

# 2. Run worker (processes videos until stopped)
python -m codecbench.pipeline.cli run --language=english --shard-count=50

# 3. Check progress
python -m codecbench.pipeline.cli stats

# 4. GPU benchmark (find optimal batch size + parallelism for this GPU)
python -m codecbench.pipeline.cli bench
```

### CLI Options

```bash
python -m codecbench.pipeline.cli run \
  --language=hindi          # Filter by language (None=any)
  --max-videos=100          # Stop after N videos
  --batch-size=2            # XCodec2 batch size
  --no-parallel             # Disable dual-stream encoding
  --shard-count=50          # Videos per shard pack
  --prefetch=2              # Pre-download N videos
  --offer-id=vast_12345     # Vast.ai instance ID
  --custom-ckpt=/path.ckpt  # Custom XCodec2 checkpoint
  --tmp-dir=/tmp/pipeline   # Local temp storage
```

### Docker

Image: `bharathkumar192/codecbench-pipeline:latest` (18.9GB, all models baked in)

```bash
# Pull from Docker Hub
docker pull bharathkumar192/codecbench-pipeline:latest

# Or build locally
docker build --build-arg HF_TOKEN=$HF_TOKEN -t codecbench-pipeline .

# Run pipeline worker with GPU
docker run --gpus all --env-file .env \
  bharathkumar192/codecbench-pipeline:latest run --language=english --shard-count=50

# Standalone benchmark (no Supabase needed, lists videos from R2)
docker run --gpus all --env-file .env --entrypoint python3 \
  bharathkumar192/codecbench-pipeline:latest \
  scripts/gpu_benchmark.py --standalone --num-videos=5 --batch-size=2

# Or with docker-compose
docker compose up -d
```

### Shard Format

Each shard is a zstd-compressed tar containing:
- `manifest.json`: metadata (video_ids, languages, segment count, codec info)
- `segments/<video_id>/<idx>.npz`: per-segment tokens (uint16 xcodec2 + bicodec_semantic + bicodec_global)

### Supabase Tables

| Table | Purpose |
|-------|---------|
| `encoding_videos` | Video registry + lifecycle (3.3M rows, indexed by status/language) |
| `encoding_workers` | Worker registration, heartbeats, RTF metrics |
| `encoding_shards` | Shard tracking (video_ids, R2 keys, sizes) |

### Worker Features

- **Async prefetch**: downloads next videos while GPU encodes current one
- **Atomic claims**: `FOR UPDATE SKIP LOCKED` prevents duplicate work across workers
- **Heartbeats**: 30s interval, stale claims auto-released after 10min timeout
- **Failure recovery**: on restart, releases own stale claims, starts fresh
- **Language meta-tags**: workers process any language, codes tagged for later grouping

### GPU Benchmarking

Detailed per-stage timing with VRAM/CPU/idle metrics:

```bash
# Standalone benchmark (no Supabase, lists from R2 directly)
python scripts/gpu_benchmark.py --standalone --num-videos=5 --batch-size=2

# Or via Supabase claims (needs PENDING videos)
python scripts/gpu_benchmark.py --num-videos=10 --batch-size=2

# Multi-GPU benchmark via Vast.ai (uses Docker image, no onstart needed)
python scripts/vastai_benchmark.py --api-key=<KEY> --gpus=RTX_4090,A100_SXM4 --num-videos=10
python scripts/vastai_benchmark.py --api-key=<KEY> --list-gpus  # show available
python scripts/vastai_benchmark.py --api-key=<KEY> --gpus=RTX_4090 --sync-local  # test local code changes
```

### Cross-GPU Benchmark Results (Async Pipeline, Vast.ai)

All benchmarks: 5 real videos from R2 (1.22 hrs audio), BS=2, 8 extract workers, chunked VAD.

| GPU | VRAM | CPU | $/hr | Pipeline RTF | Enc RTF | 1hr audio | VRAM peak | 1.45M hrs @100 GPUs |
|-----|------|-----|------|-------------|---------|-----------|-----------|---------------------|
| **L40S** | 48GB | 32c | $0.35 | **227x** | 379x | **16s** | 4947 MB | **2.7 days** |
| **RTX 4090** | 24GB | 24c | $0.27 | **211x** | 399x | **17s** | 4913 MB | **2.9 days** |
| RTX 4080S | 32GB | 24c | $0.18 | 141x | 274x | 26s | 4936 MB | 4.3 days |
| A100 80GB | 80GB | 16c | ~$1.00 | 135x | 420x | 27s | 4895 MB | 4.5 days |

RTX 5090 (Blackwell sm_120) needs torch 2.6+ -- incompatible with current image (torch 2.5.0).

**Recommended config for production:**
- **Best perf/$**: RTX 4090 ($0.14-0.27/hr) -- 211x RTF, full dataset in 2.9 days on 100 GPUs for ~$1,900
- **Best throughput**: L40S ($0.35/hr) -- 227x RTF, 32 CPU cores help async pipeline
- **Batch size**: BS=2 (optimal for all GPUs, VRAM always ~5GB)
- **Extract workers**: 8 (scale to `cores//2` for larger machines)
- **Chunked VAD**: auto for videos >300s (300s chunks, 2s overlap)

**Full dataset estimate: 1.45M hours on 100x RTX 4090 = ~3 days, ~$1,900**

### Pipeline Stage Breakdown (legacy serial baseline)

**A100 80GB PCIe (3 real videos, BS=2, parallel)**

| Stage | Total (s) | Avg/video | % of total |
|-------|----------|-----------|-----------|
| Download | 6.6 | 2.2 | 3.5% |
| Extract (ffmpeg) | 119.0 | 39.7 | **63.6%** |
| VAD (Silero) | 53.5 | 17.8 | 28.6% |
| Encode (GPU) | 8.0 | 2.7 | **4.3%** |
| Pack + Upload | 0.0 | 0.0 | 0.0% |

Encode-only RTF: **477x**, Overall RTF: **19.1x**, GPU idle: **95.7%**, VRAM peak: 4895 MB, 1hr audio in 188s

**RTX 3060 12GB (5 real videos, BS=2, parallel)**

| Stage | Total (s) | Avg/video | % of total |
|-------|----------|-----------|-----------|
| Download | 5.3 | 1.1 | 2.4% |
| Extract (ffmpeg) | 146.7 | 29.3 | **65.2%** |
| VAD (Silero) | 49.5 | 9.9 | 22.0% |
| Encode (GPU) | 22.5 | 4.5 | **10.0%** |
| Pack + Upload | 1.0 | 0.2 | 0.4% |

Encode-only RTF: **137x**, Overall RTF: **13.7x**, GPU idle: **90%**, VRAM peak: 4982 MB

**A100 Async Pipeline (3 videos, BS=2, 6 extract workers)**

| Metric | Serial | Async | Improvement |
|--------|--------|-------|-------------|
| Wall clock (3 videos) | 187s | **50.9s** | **3.7x** |
| Pipeline RTF | 19.1x | **71.8x** | **3.8x** |
| Extract avg/video | 39.7s | **3.8s** | **10.4x** |
| 1hr audio in | 188s | **50s** | **3.8x** |

Optimizations: drop loudnorm (2-pass→single), ffmpeg pipe (no disk I/O), 6 parallel extract+VAD workers, per-thread VAD models, in-memory segment_tensor().

## Final Benchmark Results (RTX 3060 12GB)

All numbers on 48 real speech segments, 6s chunks @ 16kHz. No FP16 autocast.

### Individual Codec Performance

| Codec | Original | Fast | Speedup | RTF |
|-------|----------|------|---------|-----|
| XCodec2 (B=1) | 345.8 ms | 152.5 ms | 2.27x | 39.4x |
| BiCodec (B=1) | 65.3 ms | 44.4 ms | 1.47x | 135.2x |

### Dual-Codec Pipeline (FastXCodec2 + FastBiCodec, parallel threads+streams)

| XCodec BS | Sequential | Parallel | Gain | RTF | 1hr wall | VRAM |
|---:|---:|---:|---:|---:|---:|---:|
| 1 | 195.0ms | 177.1ms | +9.2% | 33.9x | 106.3s | 5,997 MB |
| **2** | **188.5ms** | **167.9ms** | **+10.9%** | **35.7x** | **100.7s** | **6,123 MB** |
| 4 | 182.7ms | 174.4ms | +4.5% | 34.4x | 104.7s | 6,401 MB |
| 8 | 180.9ms | 174.8ms | +3.4% | 34.3x | 104.9s | 7,074 MB |

**Best config**: XCodec B=2 + BiCodec B=1, parallel threads+streams.
167.9 ms per segment encodes both codecs. 35.7x real-time.
1 hour of audio through both 16kHz codecs in 101 seconds.

**Overall speedup**: 411 ms (original sequential) → 168 ms (optimized parallel) = **2.45x**

### Token Fidelity

| Codec | Match vs Original | Source of Drift |
|-------|---:|---|
| XCodec2 semantic | 99.47% (600 samples) | GPU mel: torch float32 FFT vs numpy float64 FFT |
| BiCodec semantic | **100.00%** (50 samples) | Zero drift (truncation only) |
| BiCodec global | **100.00%** (50 samples) | Zero drift |

XCodec2 drift is deterministic (same codes differ every run), caused by cuFFT vs
pocketfft implementation differences. Not from FP16 (removed), not from truncation
(proven lossless), not from SDPA (same math).

BiCodec is fully lossless — truncation from 24→17 layers removes computation that
was immediately discarded. No GPU mel replacement needed (CPU feature extractor is
only 0.6ms).

### XCodec2 Fast: What Changed

| Optimization | Effect | Lossless? |
|---|---|---|
| GPU mel extraction | 351ms → 1ms (350x faster) | 99.47% (float32 vs float64 FFT) |
| Layer truncation (24→16) | 83ms → ~55ms wav2vec2 | 100% proven |
| SDPA attention patch | Marginal speedup | Same math |
| TF32 matmuls | ~5% speedup | Negligible drift |
| Drop attention mask | Marginal | Lossless for fixed lengths |
| Batch fix | Enables B>1 | Lossless |

### BiCodec Fast: What Changed

| Optimization | Effect | Lossless? |
|---|---|---|
| Layer truncation (24→17) | 44ms → ~31ms wav2vec2 | **100% proven** |
| SDPA attention patch | Marginal speedup | Same math |
| TF32 matmuls | ~5% speedup | Negligible drift |
| Tensor-in encode | No file path/numpy roundtrip | Lossless |

### What Was Tested and Rejected

| Optimization | Result | Why Rejected |
|---|---|---|
| FP16 autocast (XCodec2) | 0.15% extra drift, 10.6ms saved | Compounds at scale |
| FP16 autocast (BiCodec) | 1.26% semantic drift | Unacceptable at 1M hours |
| FP64 GPU mel | 0.05% improvement, shifts different codes | Not convergent to numpy |
| weight_norm removal | 0 modules affected | Checkpoint has baked weights |
| flatten_parameters | 0 RNN modules found | No LSTMs in active model |
| cudnn.benchmark | No measurable change | Fixed-shape inputs |
| cuBLASLt | Worse code match | Different BLAS rounding |

### VRAM Utilization

| GPU | Both Models | Headroom | % Used |
|-----|---:|---:|---:|
| RTX 3060 12GB | 6,123 MB | 6,165 MB | 50% |
| RTX 4090 24GB | 6,123 MB | 18,453 MB | 25% |
| A100 40GB | 6,123 MB | 34,717 MB | 15% |
| A100 80GB | 6,123 MB | 75,797 MB | 7% |
| H100 80GB | 6,123 MB | 75,797 MB | 7% |

---

## Project History & Status

### Phase 1: Codec Optimization (Complete)

Built two optimized audio codec encoders for a dual-codec parallel pipeline on a single GPU. Both encode 16kHz speech audio into discrete VQ tokens for downstream LM pretraining.

**XCodec2 Fast** (`codecbench/codecs/xcodec2_fast.py`, 376 lines)
- Source: `pip install xcodec2` (HKUSTAudio/xcodec2), wrapper is ours
- Input: 16kHz audio, Output: single VQ codebook, ~301 tokens per 6s
- Architecture: GPU mel -> wav2vec2-bert (16/24 layers) -> SemanticEncoder -> CodecEnc -> VQ
- Optimizations: GPU mel extraction (351ms->1ms), layer truncation, SDPA, TF32, batch fix
- 99.47% code match vs original (drift from cuFFT vs pocketfft, deterministic)

**BiCodec Fast** (`codecbench/codecs/bicodec_fast.py`, 293 lines)
- Source: SparkAudio/Spark-TTS-0.5B + Spark-TTS repo
- Input: 16kHz audio, Output: semantic tokens (~299) + global speaker tokens (32)
- Architecture: Wav2Vec2FeatureExtractor -> wav2vec2-xlsr-53 (17/24 layers) -> BiCodec encoder -> VQ
- Optimizations: layer truncation (24->17), TF32, tensor-in encode
- 100.00% token match (fully lossless)

**FP16 autocast was tested and explicitly rejected** for both codecs -- caused 0.15% (XCodec2) and 1.26% (BiCodec) token drift that compounds at 1M+ hours scale.

### Phase 2: Production Pipeline (Complete)

Built the full encoding pipeline in `codecbench/pipeline/`:

| Module | Purpose |
|--------|---------|
| `config.py` | Centralized config (R2, Supabase, VAD, codec, worker tunables) |
| `r2_client.py` | R2 download/upload, prefix-based video lookup, async VideoPrefetcher |
| `supabase_client.py` | Tables (DDL via psycopg2), atomic claim RPC, heartbeats, shard tracking |
| `vad.py` | Silero-VAD segmentation, 2-30s speech segments with good distribution |
| `encoder.py` | Hot GPU encoder, parallel CUDA streams, both codecs stay warm |
| `shard_packer.py` | Token packing into tar.zst with manifest + per-segment NPZ (uint16) |
| `heartbeat.py` | Daemon thread, 30s heartbeat to Supabase |
| `worker.py` | Main orchestrator: claim->download->extract->VAD->encode->pack->upload loop |
| `cli.py` | CLI: `ingest`, `setup-db`, `run`, `stats`, `bench` commands |

**Verified end-to-end with real data:**
- Downloaded videos from R2 (pt-english, pt-indic buckets)
- Extracted audio with ffmpeg (16kHz mono, EBU R128 normalized)
- VAD segmented (Silero-VAD, 95-97% speech detection, 19-33 segments per video)
- Encoded with both codecs on parallel CUDA streams
- Packed shards and uploaded to R2 `pretrain-encoded` bucket
- Full Supabase lifecycle tracking (4 shards registered, multiple videos DONE)
- Worker CLI tested: auto-claim, prefetch, heartbeat, graceful shutdown

**Supabase schema created and live:**
- `encoding_videos`: 3.3M video registry with lifecycle states, indexes on status/language
- `encoding_workers`: worker registration, heartbeat metrics, error tracking
- `encoding_shards`: shard manifest with video_ids, R2 keys, sizes
- `claim_next_video()` RPC: atomic `FOR UPDATE SKIP LOCKED`
- `release_stale_claims()` RPC: auto-release after 10min timeout

**Custom XCodec2 checkpoint support:** `--custom-ckpt` flag loads `xcodec/nikil_new/indic_step_00198000.ckpt` (5.4GB) from R2.

### Phase 3: Benchmarking & Docker (Complete)

Built `scripts/gpu_benchmark.py` -- per-stage timing benchmark with standalone mode:
- `--standalone` flag: lists videos directly from R2 (no Supabase claims needed)
- Per-stage breakdown: download, ffmpeg extract, VAD, GPU encode, pack, upload
- GPU metrics: VRAM model/peak, utilization %, TFLOPS estimate
- JSON output for cross-GPU comparison

**Docker image built and pushed**: `bharathkumar192/codecbench-pipeline:latest` (18.9GB)
- Base: `nvidia/cuda:12.1.1-runtime-ubuntu22.04` + Python 3.11
- Baked in: PyTorch 2.5.0, XCodec2, Spark-TTS + 0.5B checkpoint, Silero-VAD, all pipeline deps
- `.env` NOT baked in (pass at runtime via `--env-file`)

**A100 80GB PCIe (3 real videos, BS=2, parallel, Docker):**

| Stage | Total (s) | Avg/video | % of total |
|-------|----------|-----------|-----------|
| Download | 6.6 | 2.2 | 3.5% |
| Extract (ffmpeg) | 119.0 | 39.7 | **63.6%** |
| VAD (Silero) | 53.5 | 17.8 | 28.6% |
| Encode (GPU) | 8.0 | 2.7 | **4.3%** |

- Encode-only RTF: **477x**, Overall RTF: **19.1x**, GPU idle: **95.7%**, VRAM peak: 4895 MB
- 1 hour of audio through both codecs in **188 seconds**

**RTX 3060 12GB baseline (5 real videos, BS=2, parallel):**

- Encode-only RTF: **137.5x**, Overall RTF: **13.7x**, GPU idle: **90%**, VRAM peak: 4982 MB

**A100 vs RTX 3060**: 3.5x faster encoding (477x vs 137x RTF), but GPU idle jumps from 90% to 96% because CPU/IO can't feed the faster GPU.

Built `scripts/vastai_benchmark.py` -- Vast.ai orchestrator now uses the Docker image directly:
- No onstart script, no rsync -- everything baked in the image
- `--image` flag to specify custom image, `--sync-local` to overlay local code changes
- Auto-creates instances, waits for SSH, runs benchmark, collects JSON results, destroys instances

### Phase 4: Async Pipeline Optimization (Complete)

Built `codecbench/pipeline/async_pipeline.py` -- 3-stage pipeline that keeps CPU and GPU both busy:

```
[Download Workers] → download_q → [Extract+VAD Workers] → ready_q → [GPU Encoder]
   (2 threads)                       (N threads)                    (main thread)
```

| Optimization | Before | After | Speedup |
|---|---|---|---|
| FFmpeg loudnorm → peak norm | 39.7s/video | 3.8s/video | **10.4x** |
| Disk I/O → pipe extraction | +1-2s overhead | 0s | eliminated |
| Serial stages → parallel workers | 1 video at a time | 6 concurrently | **6x** throughput |
| Single VAD model → per-thread | thread-unsafe crashes | zero failures | fixed |
| File-based VAD → tensor-based | torchaudio.load() | segment_tensor() | no disk read |

**Result: 187s → 50.9s wall time for 1.02 hrs audio = 71.8x pipeline RTF (3.7x faster)**

Key design: `loudnorm` was the hidden killer -- EBU R128 is a 2-pass filter that doubles ffmpeg decode time. Replaced with 0.1ms peak normalization in Python. Combined with N parallel workers, the CPU can feed the GPU fast enough that the bottleneck shifts from "GPU is 96% idle" to "GPU is ~11% active but getting fed continuously".

### Phase 5: Data Pipeline Fixes (Current)

**Fix 1: >6s segment truncation → overlap chunking + center-cut stitching**

XCodec2 was trained on exactly 96,000 samples (6s). From the original repo's `config/dataset/default.yaml`:
`min_audio_length: 96000` with random windowing in training data loader.

Previous `_pad_to_chunk()` silently **cropped** segments >6s — a 25s VAD segment lost 19s of speech.
Now uses overlapping 6s windows (0.2s overlap, 5.8s stride) with deterministic center-cut stitching.

**Overlap chunking**: Segments >6s split into 6s chunks with 0.2s overlap at boundaries. Each chunk
encoded independently (matching 96k training distribution). 3.4% extra compute for cleaner boundary tokens.

**Center-cut stitch rule** (deterministic, in token-space):
- First chunk:  keep `tokens[0 : valid_len - 5]`
- Middle chunk: keep `tokens[5 : valid_len - 5]`
- Last chunk:   keep `tokens[5 : valid_len]`
- Single chunk: keep `tokens[0 : valid_len]`

Where 5 = half-overlap tokens (0.1s × 50 TPS). Drops boundary frames contaminated by Conv1d zero-padding.

**Fix 2: Padding bug** — `320 - (T % 320)` pads 320 extra when T is already aligned. Fixed to `(320 - T%320) % 320`.

**Fix 3: Token trimming** — Original `inference_save_code.py` trims to `int(audio_len / 320)`. Our code now
matches: after encoding, tokens trimmed to actual audio length. Prevents padding tokens in stored data.

**Fix 4: BiCodec batched path correctness** — The pipeline hot path previously encoded the full
XCodec2 batch but only encoded `batch[0]` for BiCodec, then reused that first BiCodec result across
the whole batch. This is a real correctness bug, not the acceptable XCodec2 fp32/cuFFT vs
numpy/fp64 drift discussed below. Downstream dataset builders must treat any pre-fix BiCodec shards
as suspect unless they are regenerated or explicitly revalidated.

| Before | After |
|---|---|
| 25s segment → 6s encoded, 19s lost | 25s → 5 overlapping chunks, stitched to ~1170 tokens |
| Hard cuts at 6s boundaries | 0.2s overlap + center-cut removes boundary contamination |
| `min_segment_s = 2.0` | `min_segment_s = 3.0` |
| 1 extra padding token per chunk | Tokens trimmed to actual audio length |
| BiCodec `batch[0]` reused across the batch | BiCodec now returns one token stream per segment |
| TF32 concern | TF32 kept ON: drift negligible vs FFT algorithm, consistent in Docker |

**fp64 GPU FFT investigation**: Tested and confirmed — the 0.53% XCodec2 token drift is from FFT algorithm
difference (cuFFT Cooley-Tukey vs numpy pocketfft split-radix), not precision. Accept 0.53% as the floor.

### What Needs to Be Done Next

**Completed:**
1. ~~Docker image~~ — `bharathkumar192/codecbench-pipeline:latest` (18.9GB)
2. ~~Cross-GPU benchmarks~~ — L40S 227x, RTX4090 211x, A100 135x pipeline RTF
3. ~~Async pipeline~~ — 3.7x speedup (187s → 50.9s for 1hr audio)
4. ~~Fix segment truncation~~ — proper 6s chunking with lineage tracking

**Production scale-up:**
5. **Dedup fingerprinting** — SHA256 content hash + energy signature per video (inline, ~5ms/video)
6. **Batch near-dedup** — cosine similarity on energy signatures (offline, one-time)
7. **Quality filtering** — SNR estimate per segment, reject below threshold
8. **Bulk ingest** — load all 3.3M videos into Supabase
9. **Multi-worker deployment** — each container claims videos atomically
10. **Global manifest** — JSONL/Parquet index across all shards for training dataloaders
11. **24kHz codecs** — SNAC + WavTokenizer pair (planned, not yet implemented)

### Environment Setup

**Credentials needed** (in `.env`, gitignored):
```
R2_ENDPOINT_URL, R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY  -- Cloudflare R2
DATABASE_URL, SUPABASE_ADMIN, URL                         -- Supabase
HF_TOKEN                                                   -- HuggingFace
VASTAI_KEY                                                 -- Vast.ai
```

**R2 Buckets:**
| Bucket | Contents |
|--------|----------|
| `pt-english` | English pretrain videos (402K, 160K hrs) |
| `pt-indic` | Indic pretrain videos (2.9M, 1.29M hrs) |
| `metafiles` | CSV metadata — pretrain CSVs have R2 files, podcast CSVs are metadata-only |
| `xcodec` | Custom XCodec2 checkpoints (nikil_new/) |
| `pretrain-encoded` | Output: encoded token shards + benchmark results |

**Metafiles CSV inventory** (in R2 `metafiles` bucket):
| CSV | Rows | Has R2 files | Notes |
|-----|------|--------------|-------|
| `english_pretrain.csv` | 402K | Yes (pt-english) | Primary English source |
| `indic_pretrain.csv` | 2.9M | Yes (pt-indic) | Primary Indic source |
| `english_podcasts.csv` | 340K | No | Metadata-only, not yet scraped |
| `indic_podcasts.csv` | 189K | No | Metadata-only, not yet scraped |

**Supabase sync** (populates `encoding_videos` table from R2 CSVs):
```bash
# List available CSVs
python -m codecbench.pipeline.cli sync --list

# Ingest pretrain CSVs (videos that actually exist in R2)
python -m codecbench.pipeline.cli sync english_pretrain.csv indic_pretrain.csv

# Controlled test run (3 videos, pack into shard of 3)
docker compose --profile test run --rm pipeline-test run \
  --max-videos=3 --shard-count=3 --language=english --tmp-dir=/tmp/pipeline
```

**External repos** (gitignored, clone at setup):
```bash
git clone https://github.com/SparkAudio/Spark-TTS.git repos/Spark-TTS
huggingface-cli download SparkAudio/Spark-TTS-0.5B --local-dir repos/Spark-TTS-0.5B
```

**Key version pins** (things that break if wrong):
- `torch==2.5.0`, `torchaudio==2.5.0`
- `transformers>=4.45,<4.50` (4.50+ has torchao compat issues)
- `torchao>=0.5,<0.6` (0.6+ needs torch 2.6+, causes `torch.int1` error)
- `xcodec2>=0.1.5`

---

## Phase 6: SFT Data Codec Encoding

### Overview
Process pre-segmented SFT audio shards (FLAC in tar) through XCodec2 to produce
token files for TTS model fine-tuning. BiCodec is excluded. XCodec2 encode path
(chunking, overlap, center-cut stitching, token trimming) is **identical** to
pretraining to maintain token uniformity.

### Datasets (bucket: `finalsftdata`)
| Dataset | Languages | Shards | Notes |
|---|---|---|---|
| final-export | 12 (as,bn,en,gu,hi,kn,ml,mr,or,pa,ta,te) | 4,350 | YouTube pretrain export |
| hifitts2 | en | 622 | HiFi-TTS v2 |
| indicvoices-r | 11 Indic | 27 | IndicVoices-R |
| josh | 7 | 140 | JoshTalks |
| joshdelivery | 5 | 105 | JoshTalks delivery |
| **TOTAL** | | **5,244** | |

### Shard structure (per shard folder)
- `audio.tar` (1-3.5 GB) — pre-segmented FLAC @ 16kHz
- `metadata.parquet` — transcriptions, speaker IDs, quality scores
- `audio_index.parquet` — tar member index
- `manifest.json` — shard manifest with integrity checksums
- `xcodec2_tokens.parquet` ← **NEW: output of this pipeline**

### Pipeline flow
```
R2 audio.tar → extract FLACs in memory → decode 16kHz → XCodec2 encode →
pack xcodec2_tokens.parquet → upload back to same shard folder
```

### CLI commands
```bash
python -m codecbench.pipeline.cli sft-snapshot       # Snapshot R2 shards → Supabase
python -m codecbench.pipeline.cli sft-run             # Start SFT worker
python -m codecbench.pipeline.cli sft-run --max-shards=1 --batch-size=4  # Test 1 shard
python -m codecbench.pipeline.cli sft-stats           # Show processing stats
```

### Benchmark (A100-80GB, single shard, 15000 segments)
- Download: 2.6 GB in 55.7s
- FLAC extraction: 15000 files in 2.3s
- Encoding: 143,617s audio in 1,155.7s → **RTF=124x**
- Upload: 14.3 MB parquet in <1s
- Total wall time: ~20 min per shard

### Supabase tracking
Table `sft_encoding_shards`: atomic `FOR UPDATE SKIP LOCKED` claim, same pattern
as pretraining pipeline. Supports hundreds of GPU workers.

### Docker
```bash
docker compose up sft-worker          # Production
docker compose --profile test up sft-test  # Test 1 shard
```
Uses `Dockerfile.sft` — lean image with XCodec2 only (no BiCodec/Spark-TTS).

---

## Next Agent Handoff

### Current production state
- SFT pipeline is now **XCodec2-only**. BiCodec is fully removed from the SFT path.
- Custom checkpoint is still the same R2 checkpoint:
  - bucket: `xcodec`
  - key: `nikil_new/indic_step_00198000.ckpt`
- Input bucket is `finalsftdata`.
- Prefixes included in SFT processing:
  - `final-export/production/shards/`
  - `hifitts2/`
  - `indicvoices-r/`
  - `josh/`
  - `joshdelivery/`
- Prefix explicitly excluded:
  - `indicvoices/`

### Root-cause of the regression we debugged
- The slowdown was **not** from XCodec2 itself.
- The slowdown was caused by the worker prefetch thread doing:
  - download of shard N+1
  - full FLAC decode of shard N+1
  - while shard N was encoding
- That background decode cut throughput roughly in half.
- Fix implemented:
  - prefetch is now **download-only**
  - decode happens only for the current shard on the main path
  - benchmark mode stops after shard 1 upload, but still confirms shard 2 download is ready

### Local benchmark (patched worker, A100-80GB, custom 198k ckpt)
- Download: `7.0s`
- Decode: `47.4s`
- Encode: `1108.5s`
- Upload: `1.3s`
- Audio: `154269s`
- Encode RTF: `139x`

### Vast 4090 benchmark (patched worker, slim image, custom 198k ckpt)
- Image used: `bharathkumar192/codecbench-sft:latest`
- Image digest:
  - `sha256:7a028c3d8ae96253cbe5be117710e97dc25cc5f254245d85e63e5a95a09e1a58`
- Custom checkpoint download from R2:
  - `5440.7 MB`
- Shard benchmark:
  - Download: `27.9s`
  - Decode: `50.2s`
  - Encode: `1063.0s`
  - Upload: `1.1s`
  - Audio: `152533s`
  - Encode RTF: `143x`
  - Effective per shard: `1114.3s`
- 4090 ETA model from benchmark:
  - `1 GPU`: `1621.3h`
  - `10 GPUs`: `162.1h`
  - `50 GPUs`: `32.4h`
  - `100 GPUs`: `16.2h`
  - `200 GPUs`: `8.1h`

### Token consistency validation
- Local self-check on one local-produced shard:
  - `1000` samples
  - average token match: `99.9312%`
- Local vs Vast 4090 shard:
  - `3000` samples
  - same length: `3000/3000`
  - bit-exact samples: `36.0%`
  - average token match: `99.7616%`
- This passes the accepted threshold (`>99%` average token match).

### DB state right now
- Supabase table: `sft_encoding_shards`
- Current state after refresh/reset:
  - all `5244` eligible shards are `PENDING`
  - included:
    - `final-export`: `4350`
    - `hifitts2`: `622`
    - `indicvoices-r`: `27`
    - `josh`: `140`
    - `joshdelivery`: `105`
- Commands now wired into repo:

```bash
python -m codecbench.pipeline.cli sft-snapshot
python -m codecbench.pipeline.cli sft-reset
python -m codecbench.pipeline.cli sft-stats
python -m codecbench.pipeline.cli sft-run --benchmark --batch-size=4
```

### Important script wiring done
- `scripts/vastai_benchmark.py`
  - now targets SFT image by default
  - uses `.env` copy to `/app/.env`
  - uses explicit SSH identity file
  - uses `RTX 4090` naming (not `RTX_4090`)
  - runs `sft-run --benchmark`
- `scripts/deploy_fleet.py`
  - now targets `bharathkumar192/codecbench-sft:latest`
  - writes `/app/.env` during `onstart`
  - starts `python -m codecbench.pipeline.cli sft-run`
  - prefers `RTX 4090`, then `RTX 3090`, then `RTX 3090 Ti`
  - uses explicit SSH identity file
  - uses proper env keys instead of broken `-e KEY` names

### How to deploy the first production worker
1. Refresh DB:

```bash
python -m codecbench.pipeline.cli sft-snapshot
python -m codecbench.pipeline.cli sft-reset
python -m codecbench.pipeline.cli sft-stats
```

2. Build/push the slim image if code changed:

```bash
docker build -f Dockerfile.sft -t bharathkumar192/codecbench-sft:latest .
docker push bharathkumar192/codecbench-sft:latest
```

3. Deploy exactly `1` worker first:

```bash
python scripts/deploy_fleet.py --deploy --num-gpus 1 --max-price 0.60
```

4. Monitor that first worker:
- Vast instance health:

```bash
python scripts/deploy_fleet.py --status
```

- Pipeline progress:

```bash
python -m codecbench.pipeline.cli sft-stats
```

- Remote worker log:
  - `ssh -i ~/.ssh/id_ed25519 -p <PORT> root@<HOST> 'tail -50 /tmp/worker.log'`

### Validation plan for the next agent
1. Start with `1x 4090`.
2. Do **not** interrupt the worker.
3. Wait until it produces its first output shard tokens in R2.
4. Download that produced token file and validate it with:

```bash
python scripts/compare_sft_tokens.py \
  --shard-key '<shard_prefix>/' \
  --token-key '<shard_prefix>/xcodec2_tokens.parquet' \
  --limit 3000
```

5. If average token match stays above `99%`, continue scaling.

### Fleet scaling plan for the next agent
1. Keep `1x 4090` alive until first output is validated.
2. Scale to `5`.
3. Monitor until all `5` are online and producing.
4. Scale to `20`.
5. Monitor until all `20` are online and producing.
6. Scale to `50`.
7. Monitor and report ETA/throughput.
8. Scale to `100`.
9. Monitor and report ETA/throughput.
10. Scale to `150`.
11. Wait until all are alive, then report ETA/throughput.
12. Scale to `200`.
13. Enter saturation mode.

### Saturation mode instructions for the next agent
- Every `20 minutes`:
  - run fleet status
  - run SFT stats
  - report current throughput and ETA
- If any machine is stuck, offline, or worker process is gone:
  - destroy it
  - replace it immediately
- GPU priority order for replacements:
  1. `RTX 4090`
  2. `RTX 3090`
  3. `RTX 3090 Ti`
- If not enough GPUs are available:
  - loosen price/network filters gradually
  - still avoid 50-series GPUs
- Keep the fleet running until the dataset is fully complete.

### Monitoring checklist for the next agent
- Confirm worker process exists on each instance.
- Confirm `/tmp/worker.log` is advancing.
- Confirm `sft_encoding_shards` `DONE` count is increasing.
- Confirm ETA trend is stable or improving.
- Replace dead/stuck instances immediately.

---

## Production Run Complete (Mar 14, 2026)

### Final numbers
- **5244 / 5244 shards DONE** (100%)
- **73,720,083 total segments** encoded
- **146,619 hours** of audio processed
- `final-export`: 4,350 shards, **60,705,931 segments**, 109,183h audio
- `hifitts2`: 622 shards, 9,096,078 segments, 26,253h audio
- `indicvoices-r`: 27 shards, 322,676 segments, 894h audio
- `josh`: 140 shards, 2,060,044 segments, 5,923h audio
- `joshdelivery`: 105 shards, 1,535,354 segments, 4,366h audio

### Fleet used
- Peak: ~307 instances (164x RTX 4090, 87x RTX 3090, 26x L40S, etc.)
- Peak cost: ~$91/hr
- Total wall time: ~14h from first deploy to last shard done
- Token validation: 99.7% average match (A100 local vs L40S remote)

### Post-run validation for next agent

Run the full validation script:

```bash
python scripts/validate_sft_output.py
```

This checks:
1. DB has 5244 DONE shards with correct segment counts
2. Every shard has `xcodec2_tokens.parquet` in R2
3. Reports per-dataset segment totals

For deeper verification (downloads + decodes parquets):

```bash
python scripts/validate_sft_output.py --deep 50
```

Expected targets:
- `final-export` segments: ~60.7M
- Total segments: ~73.7M
- All 5244 parquets present in R2

### Fleet teardown

Only after validation passes:

```bash
export VAST_KEY="$VAST_API"
python scripts/deploy_fleet.py --destroy-all
```

Or destroy all instances via the Vast.ai API directly:

```bash
python -c "
import os, requests
from dotenv import load_dotenv
load_dotenv()
api_key = os.environ.get('VAST_API', '')
headers = {'Authorization': f'Bearer {api_key}'}
r = requests.get('https://console.vast.ai/api/v0/instances/', headers=headers, timeout=30)
for inst in r.json().get('instances', []):
    requests.delete(f'https://console.vast.ai/api/v0/instances/{inst[\"id\"]}/', headers=headers, timeout=30)
    print(f'Destroyed {inst[\"id\"]}')
"
```