---
pid: -1
cwd: /home/ubuntu/xcodec2_16khz_indic
last_command: cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && HF_TOKEN=$(grep HF_TOKEN .env | cut -d= -f2) python scripts/data_prep/prepare_evaluation_data.py --samples-per-lang 500 2>&1
last_exit_code: 0
---
$ cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && HF_TOKEN=$(grep HF_TOKEN .env | cut -d= -f2) python scripts/data_prep/prepare_evaluation_data.py --samples-per-lang 500 2>&1
======================================================================
📥 DOWNLOADING EVALUATION DATASETS FOR XCODEC2 INDIC
======================================================================
Output directory: /home/ubuntu/xcodec2_16khz_indic/data/evaluation
Languages: telugu, hindi, english, tamil, kannada, malayalam, assamese, odia, marathi, punjabi, gujarati, bengali
Samples per language: 500


============================================================
📥 Processing: TELUGU
============================================================
  Downloading from FLEURS...
    Loading test split...
    Loading validation split...                                    
    telugu/validation:   8%|▊         | 26/311 [00:00<00:01, 252.45it/s    telugu/validation:  17%|█▋        | 52/311 [00:00<00:01, 238.20it/s    telugu/validation:  25%|██▌       | 79/311 [00:00<00:00, 250.17it/s                                                                         ✅ telugu: 500 samples (1.38 hours)

============================================================
📥 Processing: HINDI
============================================================
  Downloading from FLEURS...
    Loading test split...
    Loading validation split...                                   
    hindi/validation:  45%|████▍     | 107/239 [00:00<00:00, 256.15it/s    hindi/validation:  57%|█████▋    | 136/239 [00:00<00:00, 266.42it/s    hindi/validation:  69%|██████▉   | 165/239 [00:00<00:00, 271.15it/s    hindi/validation:  81%|████████  | 193/239 [00:00<00:00, 268.07it/s                                                                         ✅ hindi: 500 samples (1.36 hours)

============================================================
📥 Processing: ENGLISH
============================================================
  Using LibriSpeech test-clean (studio quality)...
    Loading LibriSpeech test-clean...
    english/librispeech:   1%|          | 22/2620 [00:00<00:12, 214.83i    english/librispeech:   2%|▏         | 44/2620 [00:00<00:12, 200.75i    english/librispeech:   3%|▎         | 66/2620 [00:00<00:12, 207.78i    english/librispeech:   3%|▎         | 89/2620 [00:00<00:11, 214.12i    english/librispeech:   4%|▍         | 111/2620 [00:00<00:11, 215.95    english/librispeech:   5%|▌         | 133/2620 [00:00<00:11, 210.59    english/librispeech:   6%|▌         | 155/2620 [00:00<00:11, 211.48    english/librispeech:   7%|▋         | 177/2620 [00:00<00:12, 190.48    english/librispeech:   8%|▊         | 204/2620 [00:00<00:11, 210.30    english/librispeech:   9%|▉         | 232/2620 [00:01<00:10, 228.36    english/librispeech:  10%|▉         | 257/2620 [00:01<00:10, 233.31    english/librispeech:  11%|█         | 281/2620 [00:01<00:10, 216.43    english/librispeech:  12%|█▏        | 304/2620 [00:01<00:11, 210.01    english/librispeech:  12%|█▏        | 326/2620 [00:01<00:11, 196.23    english/librispeech:  13%|█▎        | 353/2620 [00:01<00:10, 214.42    english/librispeech:  15%|█▍        | 380/2620 [00:01<00:09, 228.30    english/librispeech:  15%|█▌        | 404/2620 [00:01<00:09, 228.84    english/librispeech:  16%|█▋        | 428/2620 [00:02<00:10, 203.98    english/librispeech:  17%|█▋        | 455/2620 [00:02<00:09, 218.68    english/librispeech:  18%|█▊        | 480/2620 [00:02<00:09, 226.05    english/librispeech:  19%|█▉        | 504/2620 [00:02<00:09, 215.69    english/librispeech:  20%|██        | 527/2620 [00:02<00:10, 207.85    english/librispeech:  21%|██        | 549/2620 [00:02<00:10, 205.44    english/librispeech:  22%|██▏       | 571/2620 [00:02<00:09, 208.84                                                                         ✅ english: 500 samples (0.95 hours)

============================================================
📥 Processing: TAMIL
============================================================
  Downloading from FLEURS...
    Loading test split...
    Loading validation split...                                   
  ✅ tamil: 500 samples (1.41 hours)                                   

============================================================
📥 Processing: KANNADA
============================================================
  Downloading from FLEURS...
    Loading test split...
  ✅ kannada: 500 samples (1.54 hours)                              

============================================================
📥 Processing: MALAYALAM
============================================================
  Downloading from FLEURS...
    Loading test split...
  ✅ malayalam: 500 samples (1.55 hours)                              

============================================================
📥 Processing: ASSAMESE
============================================================
  Downloading from FLEURS...
    Loading test split...
  ✅ assamese: 500 samples (1.41 hours)                              

============================================================
📥 Processing: ODIA
============================================================
  Downloading from FLEURS...
    Loading test split...
Downloading data: 100%|██████████| 230M/230M [00:01<00:00, 148MB/s]  
Downloading data: 100%|██████████| 550M/550M [00:03<00:00, 175MB/s]  
Downloading data: 100%|██████████| 1.30M/1.30M [00:00<00:00, 5.28MB/s]
Downloading data: 100%|██████████| 460k/460k [00:00<00:00, 2.79MB/s]
Downloading data: 100%|██████████| 1.12M/1.12M [00:00<00:00, 5.01MB/s]
Generating train split: 1081 examples [00:07, 136.39 examples/s]
Generating validation split: 392 examples [00:02, 159.02 examples/s]
Generating test split: 883 examples [00:05, 151.08 examples/s]
  ✅ odia: 500 samples (1.44 hours)                              

============================================================
📥 Processing: MARATHI
============================================================
  Downloading from FLEURS...
    Loading test split...
Downloading data: 100%|██████████| 2.20G/2.20G [00:19<00:00, 116MB/s] 
Downloading data: 100%|██████████| 292M/292M [00:03<00:00, 95.6MB/s] 
Downloading data: 100%|██████████| 720M/720M [00:06<00:00, 111MB/s]  
Downloading data: 100%|██████████| 3.85M/3.85M [00:00<00:00, 6.65MB/s]
Downloading data: 100%|██████████| 514k/514k [00:00<00:00, 2.30MB/s]
Downloading data: 100%|██████████| 1.26M/1.26M [00:00<00:00, 5.33MB/s]
Generating train split: 3269 examples [00:26, 122.86 examples/s]
Generating validation split: 443 examples [00:03, 136.35 examples/s]
Generating test split: 1015 examples [00:08, 125.75 examples/s]
  ✅ marathi: 500 samples (1.51 hours)                               

============================================================
📥 Processing: PUNJABI
============================================================
  Downloading from FLEURS...
    Loading test split...
Downloading data: 100%|██████████| 1.19G/1.19G [00:05<00:00, 198MB/s] 
Downloading data: 100%|██████████| 144M/144M [00:01<00:00, 138MB/s]  
Downloading data: 100%|██████████| 357M/357M [00:01<00:00, 180MB/s]  
Downloading data: 100%|██████████| 2.13M/2.13M [00:00<00:00, 6.68MB/s]
Downloading data: 100%|██████████| 269k/269k [00:00<00:00, 796kB/s]
Downloading data: 100%|██████████| 656k/656k [00:00<00:00, 1.91MB/s]
Generating train split: 1923 examples [00:13, 145.15 examples/s]
Generating validation split: 251 examples [00:01, 175.23 examples/s]
Generating test split: 574 examples [00:03, 155.11 examples/s]
    Loading validation split...                                     
    punjabi/validation:  11%|█         | 27/251 [00:00<00:00, 263.14it/                                                                         ✅ punjabi: 500 samples (1.35 hours)

============================================================
📥 Processing: GUJARATI
============================================================
  Downloading from FLEURS...
    Loading test split...
Downloading data: 100%|██████████| 1.72G/1.72G [00:08<00:00, 194MB/s] 
Downloading data: 100%|██████████| 226M/226M [00:01<00:00, 144MB/s]  
Downloading data: 100%|██████████| 551M/551M [00:10<00:00, 51.1MB/s] 
Downloading data: 100%|██████████| 3.47M/3.47M [00:00<00:00, 6.71MB/s]
Downloading data: 100%|██████████| 475k/475k [00:00<00:00, 2.75MB/s]
Downloading data: 100%|██████████| 1.15M/1.15M [00:00<00:00, 6.05MB/s]
Generating train split: 3145 examples [00:19, 161.36 examples/s]
Generating validation split: 432 examples [00:02, 171.32 examples/s]
Generating test split: 1000 examples [00:05, 167.61 examples/s]
  ✅ gujarati: 500 samples (1.35 hours)                               

============================================================
📥 Processing: BENGALI
============================================================
  Downloading from FLEURS...
    Loading test split...
Downloading data: 100%|██████████| 2.03G/2.03G [00:10<00:00, 200MB/s] 
Downloading data: 100%|██████████| 279M/279M [00:01<00:00, 147MB/s]  
Downloading data: 100%|██████████| 660M/660M [00:03<00:00, 180MB/s]  
Downloading data: 100%|██████████| 3.48M/3.48M [00:00<00:00, 8.47MB/s]
Downloading data: 100%|██████████| 466k/466k [00:00<00:00, 2.77MB/s]
Downloading data: 100%|██████████| 1.09M/1.09M [00:00<00:00, 4.46MB/s]
Generating train split: 3006 examples [00:23, 126.39 examples/s]
Generating validation split: 402 examples [00:03, 126.13 examples/s]
Generating test split: 920 examples [00:07, 127.96 examples/s]
  ✅ bengali: 500 samples (1.53 hours)                              

======================================================================
📋 CREATING COMBINED EVALUATION MANIFEST
======================================================================

======================================================================
📊 EVALUATION DATA SUMMARY
======================================================================

Language      Samples    Hours Source         
--------------------------------------------------
telugu            500     1.38 fleurs         
hindi             500     1.36 fleurs         
english           500     0.95 librispeech    
tamil             500     1.41 fleurs         
kannada           500     1.54 fleurs         
malayalam         500     1.55 fleurs         
assamese          500     1.41 fleurs         
odia              500     1.44 fleurs         
marathi           500     1.51 fleurs         
punjabi           500     1.35 fleurs         
gujarati          500     1.35 fleurs         
bengali           500     1.53 fleurs         
--------------------------------------------------
TOTAL            6000    16.74

✅ Manifest: /home/ubuntu/xcodec2_16khz_indic/data/evaluation/evaluation_manifest.json
✅ TSV: /home/ubuntu/xcodec2_16khz_indic/data/evaluation/evaluation.tsv
$