--- pid: -1 cwd: /home/ubuntu/xcodec2_16khz_indic last_command: cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && HF_TOKEN=$(grep HF_TOKEN .env | cut -d= -f2) python scripts/data_prep/prepare_evaluation_data.py --samples-per-lang 500 2>&1 last_exit_code: 0 --- $ cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && HF_TOKEN=$(grep HF_TOKEN .env | cut -d= -f2) python scripts/data_prep/prepare_evaluation_data.py --samples-per-lang 500 2>&1 ====================================================================== 📥 DOWNLOADING EVALUATION DATASETS FOR XCODEC2 INDIC ====================================================================== Output directory: /home/ubuntu/xcodec2_16khz_indic/data/evaluation Languages: telugu, hindi, english, tamil, kannada, malayalam, assamese, odia, marathi, punjabi, gujarati, bengali Samples per language: 500 ============================================================ 📥 Processing: TELUGU ============================================================ Downloading from FLEURS... Loading test split... Loading validation split... telugu/validation: 8%|▊ | 26/311 [00:00<00:01, 252.45it/s telugu/validation: 17%|█▋ | 52/311 [00:00<00:01, 238.20it/s telugu/validation: 25%|██▌ | 79/311 [00:00<00:00, 250.17it/s ✅ telugu: 500 samples (1.38 hours) ============================================================ 📥 Processing: HINDI ============================================================ Downloading from FLEURS... Loading test split... Loading validation split... hindi/validation: 45%|████▍ | 107/239 [00:00<00:00, 256.15it/s hindi/validation: 57%|█████▋ | 136/239 [00:00<00:00, 266.42it/s hindi/validation: 69%|██████▉ | 165/239 [00:00<00:00, 271.15it/s hindi/validation: 81%|████████ | 193/239 [00:00<00:00, 268.07it/s ✅ hindi: 500 samples (1.36 hours) ============================================================ 📥 Processing: ENGLISH ============================================================ Using LibriSpeech test-clean (studio quality)... Loading LibriSpeech test-clean... english/librispeech: 1%| | 22/2620 [00:00<00:12, 214.83i english/librispeech: 2%|▏ | 44/2620 [00:00<00:12, 200.75i english/librispeech: 3%|▎ | 66/2620 [00:00<00:12, 207.78i english/librispeech: 3%|▎ | 89/2620 [00:00<00:11, 214.12i english/librispeech: 4%|▍ | 111/2620 [00:00<00:11, 215.95 english/librispeech: 5%|▌ | 133/2620 [00:00<00:11, 210.59 english/librispeech: 6%|▌ | 155/2620 [00:00<00:11, 211.48 english/librispeech: 7%|▋ | 177/2620 [00:00<00:12, 190.48 english/librispeech: 8%|▊ | 204/2620 [00:00<00:11, 210.30 english/librispeech: 9%|▉ | 232/2620 [00:01<00:10, 228.36 english/librispeech: 10%|▉ | 257/2620 [00:01<00:10, 233.31 english/librispeech: 11%|█ | 281/2620 [00:01<00:10, 216.43 english/librispeech: 12%|█▏ | 304/2620 [00:01<00:11, 210.01 english/librispeech: 12%|█▏ | 326/2620 [00:01<00:11, 196.23 english/librispeech: 13%|█▎ | 353/2620 [00:01<00:10, 214.42 english/librispeech: 15%|█▍ | 380/2620 [00:01<00:09, 228.30 english/librispeech: 15%|█▌ | 404/2620 [00:01<00:09, 228.84 english/librispeech: 16%|█▋ | 428/2620 [00:02<00:10, 203.98 english/librispeech: 17%|█▋ | 455/2620 [00:02<00:09, 218.68 english/librispeech: 18%|█▊ | 480/2620 [00:02<00:09, 226.05 english/librispeech: 19%|█▉ | 504/2620 [00:02<00:09, 215.69 english/librispeech: 20%|██ | 527/2620 [00:02<00:10, 207.85 english/librispeech: 21%|██ | 549/2620 [00:02<00:10, 205.44 english/librispeech: 22%|██▏ | 571/2620 [00:02<00:09, 208.84 ✅ english: 500 samples (0.95 hours) ============================================================ 📥 Processing: TAMIL ============================================================ Downloading from FLEURS... Loading test split... Loading validation split... ✅ tamil: 500 samples (1.41 hours) ============================================================ 📥 Processing: KANNADA ============================================================ Downloading from FLEURS... Loading test split... ✅ kannada: 500 samples (1.54 hours) ============================================================ 📥 Processing: MALAYALAM ============================================================ Downloading from FLEURS... Loading test split... ✅ malayalam: 500 samples (1.55 hours) ============================================================ 📥 Processing: ASSAMESE ============================================================ Downloading from FLEURS... Loading test split... ✅ assamese: 500 samples (1.41 hours) ============================================================ 📥 Processing: ODIA ============================================================ Downloading from FLEURS... Loading test split... Downloading data: 100%|██████████| 230M/230M [00:01<00:00, 148MB/s] Downloading data: 100%|██████████| 550M/550M [00:03<00:00, 175MB/s] Downloading data: 100%|██████████| 1.30M/1.30M [00:00<00:00, 5.28MB/s] Downloading data: 100%|██████████| 460k/460k [00:00<00:00, 2.79MB/s] Downloading data: 100%|██████████| 1.12M/1.12M [00:00<00:00, 5.01MB/s] Generating train split: 1081 examples [00:07, 136.39 examples/s] Generating validation split: 392 examples [00:02, 159.02 examples/s] Generating test split: 883 examples [00:05, 151.08 examples/s] ✅ odia: 500 samples (1.44 hours) ============================================================ 📥 Processing: MARATHI ============================================================ Downloading from FLEURS... Loading test split... Downloading data: 100%|██████████| 2.20G/2.20G [00:19<00:00, 116MB/s] Downloading data: 100%|██████████| 292M/292M [00:03<00:00, 95.6MB/s] Downloading data: 100%|██████████| 720M/720M [00:06<00:00, 111MB/s] Downloading data: 100%|██████████| 3.85M/3.85M [00:00<00:00, 6.65MB/s] Downloading data: 100%|██████████| 514k/514k [00:00<00:00, 2.30MB/s] Downloading data: 100%|██████████| 1.26M/1.26M [00:00<00:00, 5.33MB/s] Generating train split: 3269 examples [00:26, 122.86 examples/s] Generating validation split: 443 examples [00:03, 136.35 examples/s] Generating test split: 1015 examples [00:08, 125.75 examples/s] ✅ marathi: 500 samples (1.51 hours) ============================================================ 📥 Processing: PUNJABI ============================================================ Downloading from FLEURS... Loading test split... Downloading data: 100%|██████████| 1.19G/1.19G [00:05<00:00, 198MB/s] Downloading data: 100%|██████████| 144M/144M [00:01<00:00, 138MB/s] Downloading data: 100%|██████████| 357M/357M [00:01<00:00, 180MB/s] Downloading data: 100%|██████████| 2.13M/2.13M [00:00<00:00, 6.68MB/s] Downloading data: 100%|██████████| 269k/269k [00:00<00:00, 796kB/s] Downloading data: 100%|██████████| 656k/656k [00:00<00:00, 1.91MB/s] Generating train split: 1923 examples [00:13, 145.15 examples/s] Generating validation split: 251 examples [00:01, 175.23 examples/s] Generating test split: 574 examples [00:03, 155.11 examples/s] Loading validation split... punjabi/validation: 11%|█ | 27/251 [00:00<00:00, 263.14it/ ✅ punjabi: 500 samples (1.35 hours) ============================================================ 📥 Processing: GUJARATI ============================================================ Downloading from FLEURS... Loading test split... Downloading data: 100%|██████████| 1.72G/1.72G [00:08<00:00, 194MB/s] Downloading data: 100%|██████████| 226M/226M [00:01<00:00, 144MB/s] Downloading data: 100%|██████████| 551M/551M [00:10<00:00, 51.1MB/s] Downloading data: 100%|██████████| 3.47M/3.47M [00:00<00:00, 6.71MB/s] Downloading data: 100%|██████████| 475k/475k [00:00<00:00, 2.75MB/s] Downloading data: 100%|██████████| 1.15M/1.15M [00:00<00:00, 6.05MB/s] Generating train split: 3145 examples [00:19, 161.36 examples/s] Generating validation split: 432 examples [00:02, 171.32 examples/s] Generating test split: 1000 examples [00:05, 167.61 examples/s] ✅ gujarati: 500 samples (1.35 hours) ============================================================ 📥 Processing: BENGALI ============================================================ Downloading from FLEURS... Loading test split... Downloading data: 100%|██████████| 2.03G/2.03G [00:10<00:00, 200MB/s] Downloading data: 100%|██████████| 279M/279M [00:01<00:00, 147MB/s] Downloading data: 100%|██████████| 660M/660M [00:03<00:00, 180MB/s] Downloading data: 100%|██████████| 3.48M/3.48M [00:00<00:00, 8.47MB/s] Downloading data: 100%|██████████| 466k/466k [00:00<00:00, 2.77MB/s] Downloading data: 100%|██████████| 1.09M/1.09M [00:00<00:00, 4.46MB/s] Generating train split: 3006 examples [00:23, 126.39 examples/s] Generating validation split: 402 examples [00:03, 126.13 examples/s] Generating test split: 920 examples [00:07, 127.96 examples/s] ✅ bengali: 500 samples (1.53 hours) ====================================================================== 📋 CREATING COMBINED EVALUATION MANIFEST ====================================================================== ====================================================================== 📊 EVALUATION DATA SUMMARY ====================================================================== Language Samples Hours Source -------------------------------------------------- telugu 500 1.38 fleurs hindi 500 1.36 fleurs english 500 0.95 librispeech tamil 500 1.41 fleurs kannada 500 1.54 fleurs malayalam 500 1.55 fleurs assamese 500 1.41 fleurs odia 500 1.44 fleurs marathi 500 1.51 fleurs punjabi 500 1.35 fleurs gujarati 500 1.35 fleurs bengali 500 1.53 fleurs -------------------------------------------------- TOTAL 6000 16.74 ✅ Manifest: /home/ubuntu/xcodec2_16khz_indic/data/evaluation/evaluation_manifest.json ✅ TSV: /home/ubuntu/xcodec2_16khz_indic/data/evaluation/evaluation.tsv $