# Neucodec Encoding Pipeline — Complete Handoff

## What Was Done

All audio data in Cloudflare R2 bucket `finalsftdata` has been encoded into neucodec speech tokens for finetuning `Scicom-intl/Multilingual-TTS-1.7B-Base` (a Qwen3-1.7B based TTS model).

### Final Stats
- **5,563 shards completed** (100%)
- **154,720 hours of audio** encoded
- **67 billion neucodec tokens** generated
- **12 languages**: en (61.7k hrs), hi (18.7k), te (15.7k), ml (9.6k), ta (8.9k), pa (8.9k), gu (6.3k), bn (5.4k), kn (5.3k), mr (4.6k), or (1.9k), as (1.0k)
- **12 datasets**: final-export, hifitts2, indicvoices, josh, joshdelivery, indicvoices-r, librittsr, globe, ears, vctk, ljspeech, expresso

## What is Neucodec

`neuphonic/neucodec` is a neural audio codec that encodes 16kHz audio into discrete speech tokens:
- **Input**: 16kHz mono audio
- **Output**: 50 tokens/sec, FSQ codes as uint16 (range 0-65535)
- **Model**: 823M params — CNN acoustic encoder (CodecEnc) + wav2vec2-bert semantic model + FSQ quantizer
- **Decode**: tokens reconstruct to 24kHz audio

The TTS model (`Multilingual-TTS-1.7B-Base`) uses neucodec as its speech tokenizer. Training requires all audio pre-encoded to neucodec token sequences.

## Where the Output Lives

### R2 Bucket: `finalsftdata`
Each shard directory now contains an additional `neucodec_tokens.parquet` file alongside the original `audio.tar` and `metadata.parquet`:

```
s3://finalsftdata/<shard_prefix>/
    audio.tar              # original audio (FLAC files)
    metadata.parquet       # original metadata
    neucodec_tokens.parquet  # NEW — neucodec encoded tokens
```

### Output Parquet Schema (`neucodec_tokens.parquet`)
| Column | Type | Description |
|--------|------|-------------|
| `segment_id` | string | Matches the FLAC filename (without .flac extension) from audio.tar |
| `neucodec_tokens` | bytes | Raw bytes of uint16 numpy array — the neucodec codes |
| `token_count` | int | Number of tokens (= audio_duration_seconds * 50) |

**To read tokens for a segment:**
```python
import pandas as pd
import numpy as np

df = pd.read_parquet("neucodec_tokens.parquet")
row = df[df.segment_id == "SPEAKER_00_0001_0.00-10.50"].iloc[0]
tokens = np.frombuffer(row.neucodec_tokens, dtype=np.uint16)
# tokens.shape = (525,) for 10.5s audio at 50 tokens/sec
```

**To decode back to audio (for validation):**
```python
from neucodec import NeuCodec
import torch

codec = NeuCodec.from_pretrained("neuphonic/neucodec").eval().cuda()
codes = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).unsqueeze(0).cuda()  # (1, 1, T)
audio = codec.decode_code(codes)  # (1, 1, samples) at 24kHz
```

### Supabase Tracking Tables

**Database**: `postgresql://postgres.exlkkfpymkpqlxulurel:Chibhakaku%402001@aws-0-us-west-2.pooler.supabase.com:6543/postgres`

**`neucodec_shards`** — 5,563 rows, one per shard:
- `id`, `shard_prefix`, `dataset`, `language`, `status` (all "completed")
- `segment_count`, `total_audio_seconds`, `total_tokens`, `encode_wall_seconds`
- `output_key` — R2 path to the neucodec_tokens.parquet

**`neucodec_workers`** — worker state (historical, all exited now)

**`claim_neucodec_shard(p_worker_id)`** — PL/pgSQL function for atomic shard claiming

## R2 Access Credentials

```
R2_ENDPOINT_URL=https://cb908ed13329eb7b186e06ab51bda190.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=c3c9190ae7ff98b10271ea8db6940210
R2_SECRET_ACCESS_KEY=eab9394d02b48a865634105b92c74751ec9a311c56884f7aead5d76476c6b576
R2_BUCKET_SFT_DATA=finalsftdata
```

Full env file: `/home/ubuntu/neucodec/.env`

## How the Encoding Worked

### Architecture
1. **Supabase** — central orchestration DB with shard table + atomic claim function
2. **Cloudflare R2** — source audio (audio.tar per shard) + output (neucodec_tokens.parquet)
3. **Vast.ai GPU fleet** — 200+ rented 4090/3090 instances running worker.py
4. **Docker image** — `bharathkumar192/neucodec-worker:latest` (nvidia/cuda:12.1.1 + torch 2.5.1)

### Worker Pipeline (worker.py)
Each worker process:
1. Claims a shard atomically from Supabase (`FOR UPDATE SKIP LOCKED`)
2. Downloads `audio.tar` + `metadata.parquet` from R2
3. Opens tar, reads each FLAC file with soundfile
4. Extracts mel fbank features **on GPU** (custom torch.fft implementation, 20x faster than CPU numpy)
5. Runs neucodec encode: CodecEnc (CNN) → wav2vec2-bert (semantic, truncated to 17 layers) → FSQ quantizer
6. Chunks audio >30s to avoid OOM on 24GB GPUs
7. Collects all (segment_id, tokens) into a parquet DataFrame
8. Uploads `neucodec_tokens.parquet` back to R2 in the shard directory
9. Updates Supabase with completion stats
10. Prefetches next shard in background thread during encoding (zero download wait)

### Key Optimizations Applied
- **Layer truncation**: wav2vec2-bert has 24 layers but neucodec only uses hidden_states[16]. Truncated to 17 layers → 1.83x speedup, bit-identical output.
- **GPU fbank**: Moved mel spectrogram extraction from numpy CPU (per-frame FFT loop) to GPU (batched torch.fft.rfft). 20x faster, eliminated CPU bottleneck. Workers went from 9-26x RTF to 40-123x RTF.
- **Pipelined download**: Background thread downloads shard N+1 while shard N encodes.
- **Short-lived DB connections**: Open→query→close pattern to avoid Supabase pooler exhaustion with 200+ workers.
- **30s audio chunking**: Prevents OOM on 24GB GPUs for long audio clips.

## Files in `/home/ubuntu/neucodec/`

| File | Purpose |
|------|---------|
| `worker.py` | Main encoding worker (30KB) — the core of everything |
| `fleet.py` | Vast.ai fleet management: launch N, status, health, destroy-all |
| `monitor.py` | Background auto-recovery: reclaims stuck shards, restarts dead workers |
| `status.py` | Dashboard showing shard progress + per-worker state |
| `hotswap_workers.py` | SCP new worker.py to all instances + rolling restart |
| `rolling_restart.py` | Batched restart (30 at a time) to avoid DB pool thundering herd |
| `restart_dead_workers.py` | Check all instances, restart only dead ones |
| `spawn_extra_workers.py` | Spawn multiple worker processes per GPU (used before GPU fbank fix) |
| `Dockerfile` | Docker image definition (nvidia/cuda:12.1.1 + torch 2.5.1+cu121) |
| `start_worker.sh` | SCP'd to instances, starts worker with env vars |
| `launch_vast.sh` | Single-instance launcher (legacy, superseded by fleet.py) |
| `.env` | All credentials (R2, Supabase, HF, WandB, Vast) |

## TTS Model Context

The target model for finetuning: **`Scicom-intl/Multilingual-TTS-1.7B-Base`**
- Architecture: Qwen3-1.7B backbone
- Speech tokenizer: neucodec (what we just encoded)
- Token rate: 50 tokens/sec
- Training input: text + neucodec token sequences
- The neucodec_tokens.parquet files are the speech side of the SFT training data

## What's Next

1. **Build SFT dataset**: Pair the neucodec tokens (from `neucodec_tokens.parquet`) with the text transcripts (from `metadata.parquet`) for each segment
2. **Finetune the TTS model**: Load `Scicom-intl/Multilingual-TTS-1.7B-Base`, train on the paired (text, neucodec_tokens) data
3. **Inference**: Model generates neucodec tokens from text → decode with neucodec to get audio

## Vast.ai Fleet (DESTROYED)

All instances have been destroyed. If re-encoding is ever needed:
- Docker image `bharathkumar192/neucodec-worker:latest` is still on Docker Hub
- Supabase tables still have all tracking data
- `fleet.py launch N` can spin up a new fleet
- VAST_KEY: `e9c6879fa4946d2201d790e781ea204fee98ff100dec087847d62b92e044c363`
