# XCodec2 Training Data Pipeline

**Last Updated:** 2026-01-22

This document explains how training data flows from R2 to the model during training.

---

## 📊 Data Flow Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           R2 BUCKET (xcodec)                                │
│  manifests/                                                                  │
│  ├── telugu_16/shard-*.tar     (~720 shards, ~16k hours)                   │
│  ├── hindi_16/shard-*.tar      (~590 shards, ~15k hours)                   │
│  ├── english_16/shard-*.tar    (~598 shards, ~16k hours)                   │
│  └── ... (12 languages total)                                               │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ On-demand download
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        LOCAL NVMe CACHE (LRU)                               │
│  /nvme/shard_cache/                                                         │
│  ├── manifests/telugu_16/shard-xxx.tar    (most recent)                    │
│  ├── manifests/hindi_16/shard-yyy.tar                                      │
│  ├── manifests/english_16/shard-zzz.tar                                    │
│  └── ... (max 500GB, evicts oldest when full)                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ WebDataset streaming
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           DATA LOADER                                        │
│  1. Sample language (proportional by hours, T=1)                            │
│  2. Pick random shard from that language                                    │
│  3. Check cache → download if missing                                       │
│  4. Stream samples from shard                                               │
│  5. Apply quality filter (already pre-filtered in filelist)                 │
│  6. Random 6-second crop                                                    │
│  7. Extract W2V-BERT features                                               │
│  8. Yield batch                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              MODEL                                           │
│  Encoder → Quantizer → Decoder (+ Discriminator)                            │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## 🧭 Flow Chart (Mixed-Language Batching)

```
Start
  │
  ├─► Check per-language buffers
  │       │
  │       └─► Any buffer below threshold?
  │               │
  │               ├─► Yes ──► Pick shard from that language
  │               │               │
  │               │               ├─► Cache hit? ── Read locally
  │               │               └─► Cache miss? ── Download → Cache
  │               │                       │
  │               │                       └─► Extract samples → Add to buffer
  │               │
  │               └─► No ──► Continue to batch construction
  │
  ├─► Build mixed-language batch
  │       │
  │       └─► For each language (proportional to weight):
  │               │
  │               └─► Take N samples from buffer → Add to batch
  │
  ├─► Shuffle batch
  │
  └─► Yield batch → Repeat
```

### Example Batch Construction with Accumulator (batch_size=16)

The accumulator ensures fractional samples are tracked and ALL languages get representation:

```
Language     Weight   Expected   Accumulator  →  Samples Taken
─────────────────────────────────────────────────────────────────
                      (16×w)     (before→after)
english      14.0%    2.24       0.00 → 0.24      2
telugu       12.7%    2.03       0.00 → 0.03      2
hindi        12.6%    2.02       0.00 → 0.02      2
punjabi      12.4%    1.98       0.00 → 0.98      1 (almost 2!)
malayalam    11.0%    1.76       0.00 → 0.76      1
gujarati      9.6%    1.54       0.00 → 0.54      1
kannada       8.7%    1.39       0.00 → 0.39      1
tamil         7.3%    1.17       0.00 → 0.17      1
bengali       4.2%    0.67       0.00 → 0.67      0 (needs ~1.5 batches)
marathi       3.6%    0.58       0.00 → 0.58      0 (needs ~2 batches)
odia          3.1%    0.50       0.00 → 0.50      0 (needs 2 batches)
assamese      0.9%    0.14       0.00 → 0.14      0 (needs ~7 batches)
                                                  ─────
                                                  11 samples + 5 remaining
```

**After Batch 7**, assamese accumulator = 0.14 × 7 = **0.98** → gets 1 sample!

### Scaling with Batch Size

```
batch_size=16:   assamese gets 1 sample per ~7 batches
batch_size=32:   assamese gets 1 sample per ~4 batches  
batch_size=64:   assamese gets 1 sample per ~2 batches
batch_size=128:  assamese gets 1 sample per batch (128 × 0.009 = 1.1)
```

**The proportions are EXACT over time, regardless of batch size.**

---

## ✅ How We Know Data Is Used

**Cache is only a storage layer.** Usage is determined by the filelist and sampling:
- **Filelist controls eligibility** (only IDs in `train.tsv` are used).
- **Sampling controls exposure** (language weights + shard sampling).

This means: **we don’t guarantee every cached shard is fully used**, but **every sample that *is* used must be in the filelist**.

---

## 📈 How To Track What Is Used

Recommended lightweight tracking:
- **Shard usage counters** (per shard: how many samples yielded)
- **Sample ID counters** (optional, for auditing)
- **Language counts** (sanity check vs expected proportions)

Practical approach:
- Log counters every N steps from the dataloader.
- Persist to `logs/data_usage.jsonl` for later analysis.

## 🔄 LRU Shard Cache Strategy

### Why NOT Download-Per-Step-And-Delete?

| Approach | Pros | Cons |
|----------|------|------|
| **Delete after each batch** | ❌ Minimal disk usage | ❌ Constant network I/O, slow |
| **LRU Cache** ✅ | ✅ Fast, reuses shards | ✅ Bounded disk usage |
| **Download everything** | ✅ Fastest after download | ❌ 70k+ hours = petabytes |

### How LRU Cache Works

```python
class ShardCache:
    """
    Least Recently Used (LRU) cache for shards.
    
    - Keeps most recently used shards on disk
    - Evicts oldest shards when cache is full
    - Tracks usage order with OrderedDict
    """
    
    def __init__(self, cache_dir, max_size_gb=500):
        self.max_size = max_size_gb * 1024**3
        self.cache_order = OrderedDict()  # {shard_path: file_size}
    
    def get_shard(self, r2_key):
        local_path = self.cache_dir / r2_key
        
        if local_path.exists():
            # Move to end (most recently used)
            self.cache_order.move_to_end(r2_key)
            return local_path
        
        # Download from R2
        self._evict_if_needed(estimated_size)
        self._download(r2_key, local_path)
        self.cache_order[r2_key] = local_path.stat().st_size
        
        return local_path
    
    def _evict_if_needed(self, needed_bytes):
        # Remove oldest shards until we have space
        while self._current_size() + needed_bytes > self.max_size:
            oldest_key, _ = self.cache_order.popitem(last=False)
            (self.cache_dir / oldest_key).unlink()
```

### Cache Behavior During Training

```
Step 1-100:    Download shards as needed (cache warming)
Step 100-1000: Mix of cache hits and new downloads
Step 1000+:    Mostly cache hits, occasional eviction/download

Cache utilization over time:
    
100% ─┬──────────────────────────────────────
      │                    ████████████████████
      │               █████
      │          █████
      │     █████
  0% ─┴─────────────────────────────────────────
      0    100   500   1000  2000  ...  steps
           └─ cache warming ─┘ └─ steady state ─┘
```

---

## ⚖️ Mixed-Language Batches (Proportional, T=1)

**Key Design Decision:** Each batch contains samples from MULTIPLE languages, not just one.

### Why Mixed-Language Batches?

| Approach | Pros | Cons |
|----------|------|------|
| **Single-language batches** | Simple, good cache locality | Gradient bias, language "bursts" |
| **Mixed-language batches** ✅ | Stable gradients, language-agnostic | Slightly more complex |

### How It Works

```python
# Language weights (proportional to filtered hours)
weights = {
    "english":   0.1396,  # 14.0% (10,002h)
    "telugu":    0.1273,  # 12.7% (9,123h)
    "hindi":     0.1264,  # 12.6% (9,054h)
    "punjabi":   0.1245,  # 12.4% (8,917h)
    "malayalam": 0.1095,  # 11.0% (7,842h)
    "gujarati":  0.0960,  # 9.6%  (6,875h)
    "kannada":   0.0874,  # 8.7%  (6,263h)
    "tamil":     0.0725,  # 7.3%  (5,195h)
    "bengali":   0.0416,  # 4.2%  (2,981h)
    "marathi":   0.0360,  # 3.6%  (2,579h)
    "odia":      0.0307,  # 3.1%  (2,199h)
    "assamese":  0.0086,  # 0.9%  (617h)
}

# Each batch of 16 samples contains (approximately):
# - 2 English, 2 Telugu, 2 Hindi, 2 Punjabi
# - 2 Malayalam, 1-2 Gujarati, 1 Kannada, 1 Tamil
# - 1 Bengali/Marathi/Odia combined
# - Assamese appears in ~1/12 batches
```

### Buffer-Based Batch Construction

```
┌─────────────────────────────────────────────────────────────────────────┐
│  Per-Language Sample Buffers                                            │
├─────────────────────────────────────────────────────────────────────────┤
│  english:   [████████████████████]  (200 samples max)                  │
│  telugu:    [███████████████████ ]                                     │
│  hindi:     [██████████████████  ]                                     │
│  punjabi:   [█████████████████   ]                                     │
│  malayalam: [███████████████     ]                                     │
│  gujarati:  [█████████████       ]                                     │
│  kannada:   [████████████        ]                                     │
│  tamil:     [██████████          ]                                     │
│  bengali:   [██████              ]                                     │
│  marathi:   [█████               ]                                     │
│  odia:      [████                ]                                     │
│  assamese:  [█                   ]                                     │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ Sample proportionally
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Mixed Batch (16 samples)                                               │
│  [en, te, hi, pu, ma, gu, en, te, hi, pu, ma, ka, ta, be, te, hi]      │
└─────────────────────────────────────────────────────────────────────────┘
```

### Refill Strategy

When a language buffer runs low:
1. Pick a shard from that language
2. Download to cache (if not cached)
3. Extract samples and add to buffer
4. Continue batch construction

### Why T=1 (Proportional)?

- **T=1 (Proportional)**: Languages appear in proportion to data size
  - English/Telugu/Hindi: ~40% of each batch combined
  - Assamese: ~1 sample per ~12 batches
  
- **T<1 (Temperature)**: Would upsample minority languages
  - Good for balanced representation
  - But user wants languages with more data to have more adaptation

---

## 📁 Filelist Format

Training filelist (`data/filelists/train.tsv`):

```tsv
manifests/telugu_16/shard-abc-000000.tar	VIDEO_ID__SPEAKER_00_0014	telugu	5.33
manifests/telugu_16/shard-abc-000000.tar	VIDEO_ID__SPEAKER_00_0015	telugu	6.12
manifests/hindi_16/shard-def-000000.tar	VIDEO_ID__SPEAKER_01_0003	hindi	4.89
...
```

| Column | Description |
|--------|-------------|
| 1 | R2 shard path |
| 2 | Sample ID (unique within shard) |
| 3 | Language |
| 4 | Duration (seconds) |

---

## 🔧 Configuration

### Cache Settings (`config/dataset/r2_indic.yaml`)

```yaml
dataset:
  cache_dir: /nvme/shard_cache  # Fast local storage
  cache_size_gb: 500.0          # Adjust based on available disk
  num_workers: 8                # Workers per GPU
  prefetch_factor: 4            # Batches to prefetch per worker
```

### Recommended Cache Sizes

| Cluster | NVMe Available | Recommended Cache |
|---------|----------------|-------------------|
| 8x H200 | ~2TB each | 500GB |
| 4x A100 | ~1TB each | 300GB |
| Single GPU | ~500GB | 100GB |

---

## 📊 Expected I/O Patterns

### Network I/O (R2 Downloads)

```
Phase          | Downloads/hour | Bandwidth |
---------------|----------------|-----------|
Cache warming  | ~50-100 shards | ~50 GB/hr |
Steady state   | ~5-10 shards   | ~5 GB/hr  |
```

### Disk I/O (Cache)

```
Phase          | Reads/sec | Writes/sec |
---------------|-----------|------------|
Cache warming  | Low       | High       |
Steady state   | High      | Low        |
```

---

## 🚀 Training Flow

### Step-by-Step

1. **DataLoader requests batch**
   
2. **Sample language** (proportional by hours)
   ```
   Picked: telugu (12.7% probability)
   ```

3. **Select random shard** from that language
   ```
   Selected: manifests/telugu_16/shard-abc-000000.tar
   ```

4. **Check cache**
   ```
   Cache HIT  → Use local file
   Cache MISS → Download from R2, add to cache
   ```

5. **Stream samples from shard**
   ```
   Filter by pre-computed valid sample IDs
   (Quality filtering already done in filelist generation)
   ```

6. **Process each sample**
   ```
   Load audio → Resample to 16kHz → Random 6s crop → Pad → W2V-BERT features
   ```

7. **Yield batch to model**
   ```
   {
       'wav': [B, 96000],      # 6 seconds at 16kHz
       'feats': [B, 1, T, D],  # W2V-BERT features
   }
   ```

---

## ⚠️ Important Notes

### R2 Rate Limits
- Cloudflare R2 has no egress fees but has request limits
- Batch downloads minimize requests
- Cache reduces repeated downloads

### Cache Warming
- First ~100-500 steps will be slower (downloading)
- After cache warms up, training is I/O-bound by disk, not network
- Optional: run a short dry-run to warm cache:
  ```bash
  python scripts/data_prep/dry_run.py --steps 200
  ```

### Download De-Duplication
- The shard cache prevents multiple workers from downloading the same shard at once.
- If a shard is already downloading, other workers wait and reuse the cached file.

### Multi-Node Training
- Each node has its own cache
- Shards are randomly sampled, so overlap is minimal
- No cache coherency needed

---

## 📝 Summary

| Question | Answer |
|----------|--------|
| Do we download per step? | No, we cache shards |
| Do we delete after each step? | No, we use LRU eviction |
| How much disk space needed? | ~500GB recommended |
| How fast is training? | Network-bound initially, then disk-bound |
| How are languages sampled? | Proportionally by hours (T=1) |
| **Are batches single-language?** | **No, mixed-language per batch** |
| **Language ratio per batch?** | **Preserved using accumulator (exact over time)** |
| **Does it scale with batch size?** | **Yes, proportions stay the same** |
| **Do minority languages get samples?** | **Yes, accumulator ensures fair representation** |

---

## 🚀 Quick Start (R2 Loader)

```bash
python train.py --config-name indic_default dataset=r2_indic
```

For the live-operable trainer:
```bash
python training/indic_train_16k.py train=indic_16k dataset=r2_indic
```

---

## 📊 Evaluation Data

External high-quality evaluation data (500 samples per language) is stored in R2 under `evaluation/` prefix.

### Sources
- **FLEURS (Google)**: High-quality read speech for Indic languages
- **LibriSpeech test-clean**: Studio audiobook recordings for English

### R2 Structure
```
xcodec/evaluation/
├── telugu_eval.tar      (500 samples, 152.80 MB)
├── hindi_eval.tar       (500 samples, 150.39 MB)
├── english_eval.tar     (500 samples, 105.46 MB)
├── tamil_eval.tar       (500 samples, 156.50 MB)
├── kannada_eval.tar     (500 samples, 170.39 MB)
├── malayalam_eval.tar   (500 samples, 171.19 MB)
├── assamese_eval.tar    (500 samples, 156.04 MB)
├── odia_eval.tar        (500 samples, 159.12 MB)
├── marathi_eval.tar     (500 samples, 167.05 MB)
├── punjabi_eval.tar     (500 samples, 149.30 MB)
├── gujarati_eval.tar    (500 samples, 149.36 MB)
├── bengali_eval.tar     (500 samples, 169.05 MB)
├── evaluation_manifest.json
└── index.json
```

**Total**: 6,000 samples (~16.74 hours, ~1.86 GB)

### Shard Format (WebDataset)
Each tar contains samples in WebDataset format:
```
{sample_id}.wav   # 16kHz mono audio
{sample_id}.json  # {"id", "language", "duration", "transcription", "source"}
```

### Download Evaluation Data
```bash
# Download from R2
python scripts/data_prep/download_from_r2.py --prefix evaluation/ --output data/evaluation_r2/

# Or use local evaluation data
# data/evaluation/  (created by prepare_evaluation_data.py)
```
