---
pid: -1
cwd: /
last_command: |
  cd /home/ubuntu/xcodec2_16khz_indic && echo "=== TRANSCRIPT SAMPLES PER LANGUAGE ===" && for lang in telugu hindi english tamil kannada malayalam assamese odia marathi punjabi gujarati bengali; do                                                            echo -e "\n--- $lang ---"                  grep "\"language\": \"$lang\"" data/evaluation/evaluation_transcripts.jsonl | head -1 | python3 -c "import sys,json; d=json.loads(sys.stdin.read()); print(f'ID: {d[\"id\"]}'); print(f'Duration: {d[\"duration\"]}s'); print(f'Transcript: {d[\"transcription\"][:80]}...')"   done
last_exit_code: 141
---
============================================================
  Downloading from FLEURS...
    Loading test split...
  ✅ assamese: 500 samples (1.41 hours)                              

============================================================
📥 Processing: ODIA
============================================================
  Downloading from FLEURS...
    Loading test split...
Downloading data:  87%|████████▋ | 199M/230M [00:01<00:00, 184MB/s]$ sleep 60 && tail -80 /home/ubuntu/.cursor/projects/home-ubuntu-xcodec2-16khz-indic/terminals/4.txt
📥 DOWNLOADING EVALUATION DATASETS FOR XCODEC2 INDIC
======================================================================
Output directory: /home/ubuntu/xcodec2_16khz_indic/data/evaluation
Languages: telugu, hindi, english, tamil, kannada, malayalam, assamese, odia, marathi, punjabi, gujarati, bengali
Samples per language: 500


============================================================
📥 Processing: TELUGU
============================================================
  Downloading from FLEURS...
    Loading test split...
    Loading validation split...                                    
    telugu/validation:   8%|▊         | 26/311 [00:00<00:01, 252.45it/s    telugu/validation:  17%|█▋        | 52/311 [00:00<00:01, 238.20it/s    telugu/validation:  25%|██▌       | 79/311 [00:00<00:00, 250.17it/s                                                                         ✅ telugu: 500 samples (1.38 hours)

============================================================
📥 Processing: HINDI
============================================================
  Downloading from FLEURS...
    Loading test split...
    Loading validation split...                                   
    hindi/validation:  45%|████▍     | 107/239 [00:00<00:00, 256.15it/s    hindi/validation:  57%|█████▋    | 136/239 [00:00<00:00, 266.42it/s    hindi/validation:  69%|██████▉   | 165/239 [00:00<00:00, 271.15it/s    hindi/validation:  81%|████████  | 193/239 [00:00<00:00, 268.07it/s                                                                         ✅ hindi: 500 samples (1.36 hours)

============================================================
📥 Processing: ENGLISH
============================================================
  Using LibriSpeech test-clean (studio quality)...
    Loading LibriSpeech test-clean...
    english/librispeech:   1%|          | 22/2620 [00:00<00:12, 214.83i    english/librispeech:   2%|▏         | 44/2620 [00:00<00:12, 200.75i    english/librispeech:   3%|▎         | 66/2620 [00:00<00:12, 207.78i    english/librispeech:   3%|▎         | 89/2620 [00:00<00:11, 214.12i    english/librispeech:   4%|▍         | 111/2620 [00:00<00:11, 215.95    english/librispeech:   5%|▌         | 133/2620 [00:00<00:11, 210.59    english/librispeech:   6%|▌         | 155/2620 [00:00<00:11, 211.48    english/librispeech:   7%|▋         | 177/2620 [00:00<00:12, 190.48    english/librispeech:   8%|▊         | 204/2620 [00:00<00:11, 210.30    english/librispeech:   9%|▉         | 232/2620 [00:01<00:10, 228.36    english/librispeech:  10%|▉         | 257/2620 [00:01<00:10, 233.31    english/librispeech:  11%|█         | 281/2620 [00:01<00:10, 216.43    english/librispeech:  12%|█▏        | 304/2620 [00:01<00:11, 210.01    english/librispeech:  12%|█▏        | 326/2620 [00:01<00:11, 196.23    english/librispeech:  13%|█▎        | 353/2620 [00:01<00:10, 214.42    english/librispeech:  15%|█▍        | 380/2620 [00:01<00:09, 228.30    english/librispeech:  15%|█▌        | 404/2620 [00:01<00:09, 228.84    english/librispeech:  16%|█▋        | 428/2620 [00:02<00:10, 203.98    english/librispeech:  17%|█▋        | 455/2620 [00:02<00:09, 218.68    english/librispeech:  18%|█▊        | 480/2620 [00:02<00:09, 226.05    english/librispeech:  19%|█▉        | 504/2620 [00:02<00:09, 215.69    english/librispeech:  20%|██        | 527/2620 [00:02<00:10, 207.85    english/librispeech:  21%|██        | 549/2620 [00:02<00:10, 205.44    english/librispeech:  22%|██▏       | 571/2620 [00:02<00:09, 208.84                                                                         ✅ english: 500 samples (0.95 hours)

============================================================
📥 Processing: TAMIL
============================================================
  Downloading from FLEURS...
    Loading test split...
    Loading validation split...                                   
  ✅ tamil: 500 samples (1.41 hours)                                   

============================================================
📥 Processing: KANNADA
============================================================
  Downloading from FLEURS...
    Loading test split...
  ✅ kannada: 500 samples (1.54 hours)                              

============================================================
📥 Processing: MALAYALAM
============================================================
  Downloading from FLEURS...
    Loading test split...
  ✅ malayalam: 500 samples (1.55 hours)                              

============================================================
📥 Processing: ASSAMESE
============================================================
  Downloading from FLEURS...
    Loading test split...
  ✅ assamese: 500 samples (1.41 hours)                              

============================================================
📥 Processing: ODIA
============================================================
  Downloading from FLEURS...
    Loading test split...
Downloading data: 100%|██████████| 230M/230M [00:01<00:00, 148MB/s]  
Downloading data: 100%|██████████| 550M/550M [00:03<00:00, 175MB/s]  
Downloading data: 100%|██████████| 1.30M/1.30M [00:00<00:00, 5.28MB/s]
Downloading data: 100%|██████████| 460k/460k [00:00<00:00, 2.79MB/s]
Downloading data: 100%|██████████| 1.12M/1.12M [00:00<00:00, 5.01MB/s]
Generating train split: 1081 examples [00:07, 136.39 examples/s]
Generating validation split: 392 examples [00:02, 159.02 examples/s]
Generating test split: 883 examples [00:05, 151.08 examples/s]
  ✅ odia: 500 samples (1.44 hours)                              

============================================================
📥 Processing: MARATHI
============================================================
  Downloading from FLEURS...
    Loading test split...
Downloading data:   0%|          | 0.00/2.20G [00:00<?, ?B/s]$ sleep 90 && tail -60 /home/ubuntu/.cursor/projects/home-ubuntu-xcodec2-16khz-indic/terminals/4.txt
    Loading validation split...                                   
    hindi/validation:  45%|████▍     | 107/239 [00:00<00:00, 256.15it/s    hindi/validation:  57%|█████▋    | 136/239 [00:00<00:00, 266.42it/s    hindi/validation:  69%|██████▉   | 165/239 [00:00<00:00, 271.15it/s    hindi/validation:  81%|████████  | 193/239 [00:00<00:00, 268.07it/s                                                                         ✅ hindi: 500 samples (1.36 hours)

============================================================
📥 Processing: ENGLISH
============================================================
  Using LibriSpeech test-clean (studio quality)...
    Loading LibriSpeech test-clean...
    english/librispeech:   1%|          | 22/2620 [00:00<00:12, 214.83i    english/librispeech:   2%|▏         | 44/2620 [00:00<00:12, 200.75i    english/librispeech:   3%|▎         | 66/2620 [00:00<00:12, 207.78i    english/librispeech:   3%|▎         | 89/2620 [00:00<00:11, 214.12i    english/librispeech:   4%|▍         | 111/2620 [00:00<00:11, 215.95    english/librispeech:   5%|▌         | 133/2620 [00:00<00:11, 210.59    english/librispeech:   6%|▌         | 155/2620 [00:00<00:11, 211.48    english/librispeech:   7%|▋         | 177/2620 [00:00<00:12, 190.48    english/librispeech:   8%|▊         | 204/2620 [00:00<00:11, 210.30    english/librispeech:   9%|▉         | 232/2620 [00:01<00:10, 228.36    english/librispeech:  10%|▉         | 257/2620 [00:01<00:10, 233.31    english/librispeech:  11%|█         | 281/2620 [00:01<00:10, 216.43    english/librispeech:  12%|█▏        | 304/2620 [00:01<00:11, 210.01    english/librispeech:  12%|█▏        | 326/2620 [00:01<00:11, 196.23    english/librispeech:  13%|█▎        | 353/2620 [00:01<00:10, 214.42    english/librispeech:  15%|█▍        | 380/2620 [00:01<00:09, 228.30    english/librispeech:  15%|█▌        | 404/2620 [00:01<00:09, 228.84    english/librispeech:  16%|█▋        | 428/2620 [00:02<00:10, 203.98    english/librispeech:  17%|█▋        | 455/2620 [00:02<00:09, 218.68    english/librispeech:  18%|█▊        | 480/2620 [00:02<00:09, 226.05    english/librispeech:  19%|█▉        | 504/2620 [00:02<00:09, 215.69    english/librispeech:  20%|██        | 527/2620 [00:02<00:10, 207.85    english/librispeech:  21%|██        | 549/2620 [00:02<00:10, 205.44    english/librispeech:  22%|██▏       | 571/2620 [00:02<00:09, 208.84                                                                         ✅ english: 500 samples (0.95 hours)

============================================================
📥 Processing: TAMIL
============================================================
  Downloading from FLEURS...
    Loading test split...
    Loading validation split...                                   
  ✅ tamil: 500 samples (1.41 hours)                                   

============================================================
📥 Processing: KANNADA
============================================================
  Downloading from FLEURS...
    Loading test split...
  ✅ kannada: 500 samples (1.54 hours)                              

============================================================
📥 Processing: MALAYALAM
============================================================
  Downloading from FLEURS...
    Loading test split...
  ✅ malayalam: 500 samples (1.55 hours)                              

============================================================
📥 Processing: ASSAMESE
============================================================
  Downloading from FLEURS...
    Loading test split...
  ✅ assamese: 500 samples (1.41 hours)                              

============================================================
📥 Processing: ODIA
============================================================
  Downloading from FLEURS...
    Loading test split...
Downloading data: 100%|██████████| 230M/230M [00:01<00:00, 148MB/s]  
Downloading data: 100%|██████████| 550M/550M [00:03<00:00, 175MB/s]  
Downloading data: 100%|██████████| 1.30M/1.30M [00:00<00:00, 5.28MB/s]
Downloading data: 100%|██████████| 460k/460k [00:00<00:00, 2.79MB/s]
Downloading data: 100%|██████████| 1.12M/1.12M [00:00<00:00, 5.01MB/s]
Generating train split: 1081 examples [00:07, 136.39 examples/s]
Generating validation split: 392 examples [00:02, 159.02 examples/s]
Generating test split: 883 examples [00:05, 151.08 examples/s]
  ✅ odia: 500 samples (1.44 hours)                              

============================================================
📥 Processing: MARATHI
============================================================
  Downloading from FLEURS...
    Loading test split...
Downloading data:   0%|          | 0.00/2.20G [00:00<?, ?B/s]$ ps aux | grep prepare_evaluation || echo "Process might have finished"
ubuntu   1641125  0.0  0.0   9340  5488 ?        Ss   01:18   0:00 /bin/bash -O extglob -c snap=$(command cat <&3) && builtin shopt -s extglob && builtin eval -- "$snap" && { builtin set +u 2>/dev/null || true; builtin export PWD="$(builtin pwd)"; builtin eval "$1" < /dev/null; }; COMMAND_EXIT_CODE=$?; dump_bash_state >&4; builtin exit $COMMAND_EXIT_CODE -- cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && HF_TOKEN=$(grep HF_TOKEN .env | cut -d= -f2) python scripts/data_prep/prepare_evaluation_data.py --samples-per-lang 500 2>&1
ubuntu   1641142 85.1  3.7 13182644 4641576 ?    Rl   01:18   5:15 python scripts/data_prep/prepare_evaluation_data.py --samples-per-lang 500
ubuntu   1645058  0.0  0.0   9352  5368 ?        Ss   01:24   0:00 /bin/bash -O extglob -c snap=$(command cat <&3) && builtin shopt -s extglob && builtin eval -- "$snap" && { builtin set +u 2>/dev/null || true; builtin export PWD="$(builtin pwd)"; builtin eval "$1" < /dev/null; }; COMMAND_EXIT_CODE=$?; dump_bash_state >&4; builtin exit $COMMAND_EXIT_CODE -- ps aux | grep prepare_evaluation || echo "Process might have finished"
ubuntu   1645074  0.0  0.0   6612  2560 ?        S    01:24   0:00 grep --color=auto prepare_evaluation
$ sleep 120 && cat /home/ubuntu/.cursor/projects/home-ubuntu-xcodec2-16khz-indic/terminals/4.txt | tail -40
    Loading test split...
Downloading data: 100%|██████████| 2.03G/2.03G [00:10<00:00, 200MB/s] 
Downloading data: 100%|██████████| 279M/279M [00:01<00:00, 147MB/s]  
Downloading data: 100%|██████████| 660M/660M [00:03<00:00, 180MB/s]  
Downloading data: 100%|██████████| 3.48M/3.48M [00:00<00:00, 8.47MB/s]
Downloading data: 100%|██████████| 466k/466k [00:00<00:00, 2.77MB/s]
Downloading data: 100%|██████████| 1.09M/1.09M [00:00<00:00, 4.46MB/s]
Generating train split: 3006 examples [00:23, 126.39 examples/s]
Generating validation split: 402 examples [00:03, 126.13 examples/s]
Generating test split: 920 examples [00:07, 127.96 examples/s]
  ✅ bengali: 500 samples (1.53 hours)                              

======================================================================
📋 CREATING COMBINED EVALUATION MANIFEST
======================================================================

======================================================================
📊 EVALUATION DATA SUMMARY
======================================================================

Language      Samples    Hours Source         
--------------------------------------------------
telugu            500     1.38 fleurs         
hindi             500     1.36 fleurs         
english           500     0.95 librispeech    
tamil             500     1.41 fleurs         
kannada           500     1.54 fleurs         
malayalam         500     1.55 fleurs         
assamese          500     1.41 fleurs         
odia              500     1.44 fleurs         
marathi           500     1.51 fleurs         
punjabi           500     1.35 fleurs         
gujarati          500     1.35 fleurs         
bengali           500     1.53 fleurs         
--------------------------------------------------
TOTAL            6000    16.74

✅ Manifest: /home/ubuntu/xcodec2_16khz_indic/data/evaluation/evaluation_manifest.json
✅ TSV: /home/ubuntu/xcodec2_16khz_indic/data/evaluation/evaluation.tsv
$$ cd /home/ubuntu/xcodec2_16khz_indic && ls -la data/evaluation/*/audio | head -20
data/evaluation/assamese/audio:
total 159408
drwxrwxr-x 2 ubuntu ubuntu  20480 Jan 23 00:52 .
drwxrwxr-x 3 ubuntu ubuntu   4096 Jan 23 00:40 ..
-rw-rw-r-- 1 ubuntu ubuntu 303404 Jan 23 01:18 000000.wav
-rw-rw-r-- 1 ubuntu ubuntu 368684 Jan 23 01:18 000001.wav
-rw-rw-r-- 1 ubuntu ubuntu 266924 Jan 23 01:18 000002.wav
-rw-rw-r-- 1 ubuntu ubuntu 361004 Jan 23 01:18 000003.wav
-rw-rw-r-- 1 ubuntu ubuntu 266924 Jan 23 01:18 000004.wav
-rw-rw-r-- 1 ubuntu ubuntu 236204 Jan 23 01:18 000005.wav
-rw-rw-r-- 1 ubuntu ubuntu 395564 Jan 23 01:18 000006.wav
-rw-rw-r-- 1 ubuntu ubuntu 301484 Jan 23 01:18 000007.wav
-rw-rw-r-- 1 ubuntu ubuntu 355244 Jan 23 01:18 000008.wav
-rw-rw-r-- 1 ubuntu ubuntu 259244 Jan 23 01:18 000009.wav
-rw-rw-r-- 1 ubuntu ubuntu 243884 Jan 23 01:18 000010.wav
-rw-rw-r-- 1 ubuntu ubuntu 265004 Jan 23 01:18 000011.wav
-rw-rw-r-- 1 ubuntu ubuntu 391724 Jan 23 01:18 000012.wav
-rw-rw-r-- 1 ubuntu ubuntu 330284 Jan 23 01:18 000013.wav
-rw-rw-r-- 1 ubuntu ubuntu 207404 Jan 23 01:18 000014.wav
-rw-rw-r-- 1 ubuntu ubuntu 355244 Jan 23 01:18 000015.wav
$ cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && python scripts/data_prep/create_and_upload_eval_shards.py 2>&1
======================================================================
📦 CREATING EVALUATION SHARDS AND UPLOADING TO R2
======================================================================
Evaluation data: /home/ubuntu/xcodec2_16khz_indic/data/evaluation
R2 Bucket: xcodec
R2 Prefix: evaluation/

🔗 Connecting to R2...
  ✅ Connected to bucket: xcodec

📁 Processing: TELUGU
  Creating shard for telugu:  41%|████▏     | 207/500 [00:00<00:00, 206  Creating shard for telugu:  83%|████████▎ | 414/500 [00:00<00:00, 205                                                                         ✅ Created shard: 500 samples, 152.80 MB
  ☁️  Uploading to R2: evaluation/telugu_eval.tar...
  ✅ Uploaded successfully

📁 Processing: HINDI
  Creating shard for hindi:  42%|████▏     | 211/500 [00:00<00:00, 2097  Creating shard for hindi:  88%|████████▊ | 441/500 [00:00<00:00, 2208                                                                         ✅ Created shard: 500 samples, 150.39 MB
  ☁️  Uploading to R2: evaluation/hindi_eval.tar...
  ✅ Uploaded successfully

📁 Processing: ENGLISH
  Creating shard for english:  53%|█████▎    | 265/500 [00:00<00:00, 26                                                                         ✅ Created shard: 500 samples, 105.46 MB
  ☁️  Uploading to R2: evaluation/english_eval.tar...
  ✅ Uploaded successfully

📁 Processing: TAMIL
  Creating shard for tamil:  39%|███▉      | 196/500 [00:00<00:00, 1952  Creating shard for tamil:  82%|████████▏ | 412/500 [00:00<00:00, 2068                                                                         ✅ Created shard: 500 samples, 156.50 MB
  ☁️  Uploading to R2: evaluation/tamil_eval.tar...
  ✅ Uploaded successfully

📁 Processing: KANNADA
  Creating shard for kannada:  39%|███▉      | 197/500 [00:00<00:00, 19  Creating shard for kannada:  81%|████████  | 405/500 [00:00<00:00, 20                                                                         ✅ Created shard: 500 samples, 170.39 MB
  ☁️  Uploading to R2: evaluation/kannada_eval.tar...
  ✅ Uploaded successfully

📁 Processing: MALAYALAM
  Creating shard for malayalam:  41%|████      | 206/500 [00:00<00:00,   Creating shard for malayalam:  83%|████████▎ | 416/500 [00:00<00:00,                                                                          ✅ Created shard: 500 samples, 171.19 MB
  ☁️  Uploading to R2: evaluation/malayalam_eval.tar...
  ✅ Uploaded successfully

📁 Processing: ASSAMESE
  Creating shard for assamese:  45%|████▍     | 224/500 [00:00<00:00, 2  Creating shard for assamese:  90%|████████▉ | 448/500 [00:00<00:00, 2                                                                         ✅ Created shard: 500 samples, 156.04 MB
  ☁️  Uploading to R2: evaluation/assamese_eval.tar...
  ✅ Uploaded successfully

📁 Processing: ODIA
  Creating shard for odia:  43%|████▎     | 214/500 [00:00<00:00, 2139.  Creating shard for odia:  86%|████████▌ | 431/500 [00:00<00:00, 2155.                                                                         ✅ Created shard: 500 samples, 159.12 MB
  ☁️  Uploading to R2: evaluation/odia_eval.tar...
  ✅ Uploaded successfully

📁 Processing: MARATHI
  Creating shard for marathi:  42%|████▏     | 208/500 [00:00<00:00, 20  Creating shard for marathi:  85%|████████▌ | 425/500 [00:00<00:00, 21                                                                         ✅ Created shard: 500 samples, 167.05 MB
  ☁️  Uploading to R2: evaluation/marathi_eval.tar...
  ✅ Uploaded successfully

📁 Processing: PUNJABI
  Creating shard for punjabi:  46%|████▌     | 228/500 [00:00<00:00, 22  Creating shard for punjabi:  94%|█████████▎| 468/500 [00:00<00:00, 23                                                                         ✅ Created shard: 500 samples, 149.30 MB
  ☁️  Uploading to R2: evaluation/punjabi_eval.tar...
  ✅ Uploaded successfully

📁 Processing: GUJARATI
  Creating shard for gujarati:  44%|████▍     | 222/500 [00:00<00:00, 2  Creating shard for gujarati:  90%|█████████ | 451/500 [00:00<00:00, 2                                                                         ✅ Created shard: 500 samples, 149.36 MB
  ☁️  Uploading to R2: evaluation/gujarati_eval.tar...
  ✅ Uploaded successfully

📁 Processing: BENGALI
  Creating shard for bengali:  43%|████▎     | 214/500 [00:00<00:00, 21  Creating shard for bengali:  86%|████████▌ | 428/500 [00:00<00:00, 21                                                                         ✅ Created shard: 500 samples, 169.05 MB
  ☁️  Uploading to R2: evaluation/bengali_eval.tar...
  ✅ Uploaded successfully

📋 Uploading evaluation manifest...
  ✅ Manifest uploaded: evaluation/evaluation_manifest.json
  ✅ Index uploaded: evaluation/index.json

======================================================================
📊 UPLOAD SUMMARY
======================================================================

Language      Samples  Size (MB) R2 Key                        
----------------------------------------------------------------------
telugu            500     152.80 evaluation/telugu_eval.tar    
hindi             500     150.39 evaluation/hindi_eval.tar     
english           500     105.46 evaluation/english_eval.tar   
tamil             500     156.50 evaluation/tamil_eval.tar     
kannada           500     170.39 evaluation/kannada_eval.tar   
malayalam         500     171.19 evaluation/malayalam_eval.tar 
assamese          500     156.04 evaluation/assamese_eval.tar  
odia              500     159.12 evaluation/odia_eval.tar      
marathi           500     167.05 evaluation/marathi_eval.tar   
punjabi           500     149.30 evaluation/punjabi_eval.tar   
gujarati          500     149.36 evaluation/gujarati_eval.tar  
bengali           500     169.05 evaluation/bengali_eval.tar   
----------------------------------------------------------------------
TOTAL            6000    1856.65

✅ Uploaded 12 shards to R2
   Bucket: xcodec
   Prefix: evaluation/

🔗 Access shards at:
   - xcodec/evaluation/telugu_eval.tar
   - xcodec/evaluation/hindi_eval.tar
   - xcodec/evaluation/english_eval.tar
   - xcodec/evaluation/tamil_eval.tar
   - xcodec/evaluation/kannada_eval.tar
   - xcodec/evaluation/malayalam_eval.tar
   - xcodec/evaluation/assamese_eval.tar
   - xcodec/evaluation/odia_eval.tar
   - xcodec/evaluation/marathi_eval.tar
   - xcodec/evaluation/punjabi_eval.tar
   - xcodec/evaluation/gujarati_eval.tar
   - xcodec/evaluation/bengali_eval.tar
$ cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && python -c "
         import boto3
                     from botocore.config import Config
                                                       from dotenv import load_dotenv
              import os

                       load_dotenv()

                                    s3 = boto3.client(
                                                          's3',
                                                                   endpoint_url=os.getenv('R2_ENDPOINT_URL'),
                                          aws_access_key_id=os.getenv('R2_ACCESS_KEY_ID'),
                       aws_secret_access_key=os.getenv('R2_SECRET_ACCESS_KEY'),
            config=Config(signature_version='s3v4', s3={'addressing_style': 'path'})
             )

              print('📂 Files in xcodec/evaluation/:')
                                                      response = s3.list_objects_v2(Bucket='xcodec', Prefix='evaluation/')
                                                   for obj in response.get('Contents', []):
                        size_mb = obj['Size'] / 1024 / 1024
                                                               print(f'  {obj[\"Key\"]:45} {size_mb:>8.2f} MB')
                                        "
📂 Files in xcodec/evaluation/:
  evaluation/assamese_eval.tar                    156.04 MB
  evaluation/bengali_eval.tar                     169.05 MB
  evaluation/english_eval.tar                     105.46 MB
  evaluation/evaluation_manifest.json               2.85 MB
  evaluation/gujarati_eval.tar                    149.36 MB
  evaluation/hindi_eval.tar                       150.39 MB
  evaluation/index.json                             0.00 MB
  evaluation/kannada_eval.tar                     170.39 MB
  evaluation/malayalam_eval.tar                   171.19 MB
  evaluation/marathi_eval.tar                     167.05 MB
  evaluation/odia_eval.tar                        159.12 MB
  evaluation/punjabi_eval.tar                     149.30 MB
  evaluation/tamil_eval.tar                       156.50 MB
  evaluation/telugu_eval.tar                      152.80 MB
$ cd /home/ubuntu/xcodec2_16khz_indic && head -100 data/evaluation/telugu/metadata.json
{
  "language": "telugu",
  "source": "fleurs",
  "lang_code": "te_in",
  "num_samples": 500,
  "total_duration_hours": 1.3778,
  "samples": [
    {
      "id": "telugu_000000",
      "path": "telugu/audio/000000.wav",
      "duration": 8.28,
      "language": "telugu",
      "source": "fleurs",
      "split": "test",
      "transcription": "చిన్న ద్వీపాలలో చాలా వరకు స్వతంత్ర దేశాలు లేదా ఫ్రాన్స్ తో సంబంధం కలిగి ఉన్నాయి ఇంకా వీటిని లగ్జరీ బీచ్ రిసార్ట్ స్ అని పిలుస్తారు"
    },
    {
      "id": "telugu_000001",
      "path": "telugu/audio/000001.wav",
      "duration": 6.72,
      "language": "telugu",
      "source": "fleurs",
      "split": "test",
      "transcription": "కొన్ని క్రియలు ఆబ్జెక్టుల మధ్య తేడాను గుర్తించడానికి ఇది ఒక ముఖ్యమైన మార్గం"
    },
    {
      "id": "telugu_000002",
      "path": "telugu/audio/000002.wav",
      "duration": 10.08,
      "language": "telugu",
      "source": "fleurs",
      "split": "test",
      "transcription": "మీరు మీ స్వంతం ఆలోచనలతో కాకుండా ప్రభుత్వాల సలహా కూడా తీసుకోవాలని అనుకోవచ్చు అయితే వారి సలహా వారి పౌరుల కోసం రూపొందించబడింది"
    },
    {
      "id": "telugu_000003",
      "path": "telugu/audio/000003.wav",
      "duration": 5.94,
      "language": "telugu",
      "source": "fleurs",
      "split": "test",
      "transcription": "ఈ దృశ్యాలు పిరమిడ్లపై ప్రదర్శించబడతాయి మరియు వేరే పిరమిడ్లను వెలిగించబడ్డాయి"
    },
    {
      "id": "telugu_000004",
      "path": "telugu/audio/000004.wav",
      "duration": 10.62,
      "language": "telugu",
      "source": "fleurs",
      "split": "test",
      "transcription": "పరికరాన్ని గాల్లో కదిలించడం ద్వారా వీడియో గేమ్స్లో యాక్షన్లు మరియు కదలికలను నియంత్రించడానికి ఇది ఆటగాళ్లకు వీలు కల్పిస్తుంది"
    },
    {
      "id": "telugu_000005",
      "path": "telugu/audio/000005.wav",
      "duration": 7.08,
      "language": "telugu",
      "source": "fleurs",
      "split": "test",
      "transcription": "కాల్షియం పొటాషియం వంటి మూలకాలను లోహాలుగా పరిగణిస్తారు వెండి బంగారం వంటి లోహాలు కూడా ఉన్నాయి"
    },
    {
      "id": "telugu_000006",
      "path": "telugu/audio/000006.wav",
      "duration": 5.4,
      "language": "telugu",
      "source": "fleurs",
      "split": "test",
      "transcription": "ఆ విధంగా పెన్సిల్ బయటకు వచ్చినప్పుడు చాలా మందికి మంచి స్నేహితుడు"
    },
    {
      "id": "telugu_000007",
      "path": "telugu/audio/000007.wav",
      "duration": 8.4,
      "language": "telugu",
      "source": "fleurs",
      "split": "test",
      "transcription": "బుధవారం జరిగిన ఈవెంట్కు ముందు carpanedo ఛాంపియన్షిప్లో రెండు వ్యక్తిగత రేసులలో పాల్గొన్నారు"
    },
    {
      "id": "telugu_000008",
      "path": "telugu/audio/000008.wav",
      "duration": 12.12,
      "language": "telugu",
      "source": "fleurs",
      "split": "test",
      "transcription": "పట్టాలను నిలబెట్టడానికి క్రాస్ టీలు చాలా ముందుగానే ప్రవేశపెట్టబడ్డాయి అయితే పైభాగంలో ఇనుప ఖనిజం ఉంటే ట్రాక్ లు మరింత సమర్థవంతంగా ఉంటాయని క్రమంగా గ్రహించారు"
    },
    {
      "id": "telugu_000009",
      "path": "telugu/audio/000009.wav",
      "duration": 12.18,
      "language": "telugu",
      "source": "fleurs",
      "split": "test",
      "transcription": "ఇది బేస్ స్టేషన్ లో డ్యూయల్ రేడియోలు ఉంటే 802.11a 802.11b మరియు 802.11gతో అనుకూలంగా వెనుకకు అనుమతిస్తుంది"
    },
    {
      "id": "telugu_000010",
      "path": "telugu/audio/000010.wav",
$ cd /home/ubuntu/xcodec2_16khz_indic && head -100 data/evaluation/english/metadata.json
{
  "language": "english",
  "source": "librispeech",
  "num_samples": 500,
  "total_duration_hours": 0.9474,
  "samples": [
    {
      "id": "english_000000",
      "path": "english/audio/000000.wav",
      "duration": 3.505,
      "language": "english",
      "source": "librispeech",
      "transcription": "CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS"
    },
    {
      "id": "english_000001",
      "path": "english/audio/000001.wav",
      "duration": 14.225,
      "language": "english",
      "source": "librispeech",
      "transcription": "THE ENGLISH FORWARDED TO THE FRENCH BASKETS OF FLOWERS OF WHICH THEY HAD MADE A PLENTIFUL PROVISION TO GREET THE ARRIVAL OF THE YOUNG PRINCESS THE FRENCH IN RETURN INVITED THE ENGLISH TO A SUPPER WHICH WAS TO BE GIVEN THE NEXT DAY"
    },
    {
      "id": "english_000002",
      "path": "english/audio/000002.wav",
      "duration": 5.025,
      "language": "english",
      "source": "librispeech",
      "transcription": "CONGRATULATIONS WERE POURED IN UPON THE PRINCESS EVERYWHERE DURING HER JOURNEY"
    },
    {
      "id": "english_000003",
      "path": "english/audio/000003.wav",
      "duration": 11.065,
      "language": "english",
      "source": "librispeech",
      "transcription": "SHE TAUGHT HER DAUGHTER THEN BY HER OWN AFFECTION FOR IT THAT LOVE FOR A COUNTRY WHERE THEY HAD BOTH BEEN HOSPITABLY RECEIVED AND WHERE A BRILLIANT FUTURE OPENED BEFORE THEM"
    },
    {
      "id": "english_000004",
      "path": "english/audio/000004.wav",
      "duration": 13.16,
      "language": "english",
      "source": "librispeech",
      "transcription": "THE COUNT HAD THROWN HIMSELF BACK ON HIS SEAT LEANING HIS SHOULDERS AGAINST THE PARTITION OF THE TENT AND REMAINED THUS HIS FACE BURIED IN HIS HANDS WITH HEAVING CHEST AND RESTLESS LIMBS"
    },
    {
      "id": "english_000005",
      "path": "english/audio/000005.wav",
      "duration": 5.85,
      "language": "english",
      "source": "librispeech",
      "transcription": "THIS HAS INDEED BEEN A HARASSING DAY CONTINUED THE YOUNG MAN HIS EYES FIXED UPON HIS FRIEND"
    },
    {
      "id": "english_000006",
      "path": "english/audio/000006.wav",
      "duration": 3.315,
      "language": "english",
      "source": "librispeech",
      "transcription": "YOU WILL BE FRANK WITH ME I ALWAYS AM"
    },
    {
      "id": "english_000007",
      "path": "english/audio/000007.wav",
      "duration": 4.785,
      "language": "english",
      "source": "librispeech",
      "transcription": "CAN YOU IMAGINE WHY BUCKINGHAM HAS BEEN SO VIOLENT I SUSPECT"
    },
    {
      "id": "english_000008",
      "path": "english/audio/000008.wav",
      "duration": 7.28,
      "language": "english",
      "source": "librispeech",
      "transcription": "IT IS YOU WHO ARE MISTAKEN RAOUL I HAVE READ HIS DISTRESS IN HIS EYES IN HIS EVERY GESTURE AND ACTION THE WHOLE DAY"
    },
    {
      "id": "english_000009",
      "path": "english/audio/000009.wav",
      "duration": 3.035,
      "language": "english",
      "source": "librispeech",
      "transcription": "I CAN PERCEIVE LOVE CLEARLY ENOUGH"
    },
    {
      "id": "english_000010",
      "path": "english/audio/000010.wav",
      "duration": 3.195,
      "language": "english",
      "source": "librispeech",
      "transcription": "I AM CONVINCED OF WHAT I SAY SAID THE COUNT"
    },
    {
      "id": "english_000011",
      "path": "english/audio/000011.wav",
      "duration": 2.94,
      "language": "english",
      "source": "librispeech",
$ cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && python -c "
         import json
                    from pathlib import Path

                                            eval_dir = Path('data/evaluation')
       output_lines = []

                        languages = ['telugu', 'hindi', 'english', 'tamil', 'kannada', 'malayalam', 
                                          'assamese', 'odia', 'marathi', 'punjabi', 'gujarati', 'bengali']

                                   for lang in languages:
                                                             metadata_path = eval_dir / lang / 'metadata.json'
                                           if not metadata_path.exists():
          continue
                      
                          with open(metadata_path) as f:
                                                                metadata = json.load(f)
                    
                        for sample in metadata.get('samples', []):
                                                                          entry = {
                        'id': sample['id'],
                                                       'audio_path': f'evaluation/{lang}_eval.tar::{sample[\"id\"]}.wav',  # R2 path
                                                                         'local_path': sample['path'],  # local relative path
                                                                  'language': sample['language'],
                                      'duration': sample['duration'],
                                                                                 'transcription': sample.get('transcription', ''),
                                                                       'source': sample.get('source', 'unknown'),
                                                  }
                                                           output_lines.append(json.dumps(entry, ensure_ascii=False))

                                              # Write JSONL
                                                           output_path = eval_dir / 'evaluation_transcripts.jsonl'
                                           with open(output_path, 'w', encoding='utf-8') as f:
                           f.write('\n'.join(output_lines))

                                                           print(f'Created: {output_path}')
                    print(f'Total entries: {len(output_lines)}')
                                                                print(f'File size: {output_path.stat().st_size / 1024:.2f} KB')

                                                        # Show sample entries
      print('\n--- Sample entries ---')
                                       for lang in ['telugu', 'hindi', 'english']:
               for line in output_lines:
                                                entry = json.loads(line)
         if entry['language'] == lang:
                                                  print(f'{entry[\"id\"]}: {entry[\"transcription\"][:60]}...')
                                                    break
                                                         "
Created: data/evaluation/evaluation_transcripts.jsonl
Total entries: 6000
File size: 2932.68 KB

--- Sample entries ---
telugu_000000: చిన్న ద్వీపాలలో చాలా వరకు స్వతంత్ర దేశాలు లేదా ఫ్రాన్స్ తో స...
hindi_000000: कुछ अणुओं में अस्थिर केंद्रक होता है जिसका मतलब यह है कि उनम...
english_000000: CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS...
$ cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && python -c "
         import boto3
                     from botocore.config import Config
                                                       from dotenv import load_dotenv
              import os
                       from pathlib import Path

                                               load_dotenv()

                                                            s3 = boto3.client(
           's3',
                    endpoint_url=os.getenv('R2_ENDPOINT_URL'),
                                                                  aws_access_key_id=os.getenv('R2_ACCESS_KEY_ID'),
                                               aws_secret_access_key=os.getenv('R2_SECRET_ACCESS_KEY'),
                                    config=Config(signature_version='s3v4', s3={'addressing_style': 'path'})
                                     )

                                      # Upload transcripts JSONL
                                                                jsonl_path = Path('data/evaluation/evaluation_transcripts.jsonl')
                                                          r2_key = 'evaluation/evaluation_transcripts.jsonl'

                                     with open(jsonl_path, 'rb') as f:
                                                                          s3.upload_fileobj(f, 'xcodec', r2_key, ExtraArgs={'ContentType': 'application/jsonl'})

                  print(f'✅ Uploaded: xcodec/{r2_key}')
                                                        print(f'   Size: {jsonl_path.stat().st_size / 1024:.2f} KB')

                                             # Also upload the full manifest again (with transcripts)
                              manifest_path = Path('data/evaluation/evaluation_manifest.json')
                       r2_key_manifest = 'evaluation/evaluation_manifest.json'

       with open(manifest_path, 'rb') as f:
                                               s3.upload_fileobj(f, 'xcodec', r2_key_manifest, ExtraArgs={'ContentType': 'application/json'})

                                                                      print(f'✅ Uploaded: xcodec/{r2_key_manifest}')

                                              # List all files in evaluation/
      print('\n📂 Updated files in xcodec/evaluation/:')
                                                        response = s3.list_objects_v2(Bucket='xcodec', Prefix='evaluation/')
                                                     for obj in response.get('Contents', []):
                          size_mb = obj['Size'] / 1024 / 1024
                                                                 if size_mb < 1:
                 size_str = f'{obj[\"Size\"] / 1024:.2f} KB'
                                                                else:
                                                                             size_str = f'{size_mb:.2f} MB'
                                        print(f'  {obj[\"Key\"]:50} {size_str:>12}')
             "
✅ Uploaded: xcodec/evaluation/evaluation_transcripts.jsonl
   Size: 2932.68 KB
✅ Uploaded: xcodec/evaluation/evaluation_manifest.json

📂 Updated files in xcodec/evaluation/:
  evaluation/assamese_eval.tar                          156.04 MB
  evaluation/bengali_eval.tar                           169.05 MB
  evaluation/english_eval.tar                           105.46 MB
  evaluation/evaluation_manifest.json                     2.85 MB
  evaluation/evaluation_transcripts.jsonl                 2.86 MB
  evaluation/gujarati_eval.tar                          149.36 MB
  evaluation/hindi_eval.tar                             150.39 MB
  evaluation/index.json                                   1.87 KB
  evaluation/kannada_eval.tar                           170.39 MB
  evaluation/malayalam_eval.tar                         171.19 MB
  evaluation/marathi_eval.tar                           167.05 MB
  evaluation/odia_eval.tar                              159.12 MB
  evaluation/punjabi_eval.tar                           149.30 MB
  evaluation/tamil_eval.tar                             156.50 MB
  evaluation/telugu_eval.tar                            152.80 MB
$ cd /home/ubuntu/xcodec2_16khz_indic && head -5 data/evaluation/evaluation_transcripts.jsonl | python -m json.tool --no-ensure-ascii 2>/dev/null || head -5 data/evaluation/evaluation_transcripts.jsonl
{"id": "telugu_000000", "audio_path": "evaluation/telugu_eval.tar::telugu_000000.wav", "local_path": "telugu/audio/000000.wav", "language": "telugu", "duration": 8.28, "transcription": "చిన్న ద్వీపాలలో చాలా వరకు స్వతంత్ర దేశాలు లేదా ఫ్రాన్స్ తో సంబంధం కలిగి ఉన్నాయి ఇంకా వీటిని లగ్జరీ బీచ్ రిసార్ట్ స్ అని పిలుస్తారు", "source": "fleurs"}
{"id": "telugu_000001", "audio_path": "evaluation/telugu_eval.tar::telugu_000001.wav", "local_path": "telugu/audio/000001.wav", "language": "telugu", "duration": 6.72, "transcription": "కొన్ని క్రియలు ఆబ్జెక్టుల మధ్య తేడాను గుర్తించడానికి ఇది ఒక ముఖ్యమైన మార్గం", "source": "fleurs"}
{"id": "telugu_000002", "audio_path": "evaluation/telugu_eval.tar::telugu_000002.wav", "local_path": "telugu/audio/000002.wav", "language": "telugu", "duration": 10.08, "transcription": "మీరు మీ స్వంతం ఆలోచనలతో కాకుండా ప్రభుత్వాల సలహా కూడా తీసుకోవాలని అనుకోవచ్చు అయితే వారి సలహా వారి పౌరుల కోసం రూపొందించబడింది", "source": "fleurs"}
{"id": "telugu_000003", "audio_path": "evaluation/telugu_eval.tar::telugu_000003.wav", "local_path": "telugu/audio/000003.wav", "language": "telugu", "duration": 5.94, "transcription": "ఈ దృశ్యాలు పిరమిడ్లపై ప్రదర్శించబడతాయి మరియు వేరే పిరమిడ్లను వెలిగించబడ్డాయి", "source": "fleurs"}
{"id": "telugu_000004", "audio_path": "evaluation/telugu_eval.tar::telugu_000004.wav", "local_path": "telugu/audio/000004.wav", "language": "telugu", "duration": 10.62, "transcription": "పరికరాన్ని గాల్లో కదిలించడం ద్వారా వీడియో గేమ్స్లో యాక్షన్లు మరియు కదలికలను నియంత్రించడానికి ఇది ఆటగాళ్లకు వీలు కల్పిస్తుంది", "source": "fleurs"}
$ cd /home/ubuntu/xcodec2_16khz_indic && echo "=== TRANSCRIPT SAMPLES PER LANGUAGE ===" && for lang in telugu hindi english tamil kannada malayalam assamese odia marathi punjabi gujarati bengali; do
                                                            echo -e "\n--- $lang ---"
                  grep "\"language\": \"$lang\"" data/evaluation/evaluation_transcripts.jsonl | head -1 | python3 -c "import sys,json; d=json.loads(sys.stdin.read()); print(f'ID: {d[\"id\"]}'); print(f'Duration: {d[\"duration\"]}s'); print(f'Transcript: {d[\"transcription\"][:80]}...')"
   done
=== TRANSCRIPT SAMPLES PER LANGUAGE ===

--- telugu ---
ID: telugu_000000
Duration: 8.28s
Transcript: చిన్న ద్వీపాలలో చాలా వరకు స్వతంత్ర దేశాలు లేదా ఫ్రాన్స్ తో సంబంధం కలిగి ఉన్నాయి ...

--- hindi ---
ID: hindi_000000
Duration: 9.12s
Transcript: कुछ अणुओं में अस्थिर केंद्रक होता है जिसका मतलब यह है कि उनमें थोड़े या बिना किस...

--- english ---
ID: english_000000
Duration: 3.505s
Transcript: CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS...

--- tamil ---
ID: tamil_000000
Duration: 7.8s
Transcript: இது வேதியியல் ph என அழைக்கப்படுகிறது நீங்கள் சிவப்பு முட்டைக்கோஸ் சாற்றைப் பயன்ப...

--- kannada ---
ID: kannada_000000
Duration: 11.34s
Transcript: ಆದರೆ ನಾಯಕನ ವಿಕೆಟ್ ಕಳೆದುಕೊಂಡ ನಂತರ ಭಾರತ 7 ವಿಕೆಟ್ ಕಳೆದುಕೊಂಡು ಕೇವಲ 36 ರನ್ಗಳಿಗೆ ತನ್ನ ...

--- malayalam ---
ID: malayalam_000000
Duration: 11.4s
Transcript: ഈ നഗരം രാജ്യത്തെ മറ്റ് നഗരങ്ങളിൽ നിന്ന് പൂർണ്ണമായും വ്യത്യസ്തമാണ് കാരണം ഇതിന് ആഫ...

--- assamese ---
ID: assamese_000000
Duration: 9.48s
Transcript: প্ৰতিবেদনটোত ইৰাকৰ কাৰ্যপালিকাৰ বৰ্তমানৰ প্ৰতিটো নীতিৰ অতি বেছি সমালোচনা কৰা হৈছ...

--- odia ---
ID: odia_000000
Duration: 10.86s
Transcript: ଯେଉଁ ଯୁଗରେ ଘଟଣାଗୁଡ଼ିକ ଘଟିଥିଲା ​​ତାହାକୁ ସାଧାରଣତଃ 11ଶ 12ଶ ଏବଂ 13ଶ ଶତାବ୍ଦୀରେ ad 100...

--- marathi ---
ID: marathi_000000
Duration: 8.16s
Transcript: पोलिस अधीक्षक चंद्र शेखर सोलंकी यांनी सांगितले की आरोपी चेहरा झाकून घेऊन कोर्टात...

--- punjabi ---
ID: punjabi_000000
Duration: 7.14s
Transcript: ਪਿਰਾਮਿਡ ਦਾ ਸਾਊਂਡ ਅਤੇ ਲਾਈਟ ਸ਼ੋਅ ਇਸ ਖੇਤਰ ਅੰਦਰ ਬੱਚਿਆਂ ਲਈ ਕਰਨ ਵਾਲੀਆਂ ਸਭ ਤੋਂ ਮਜ਼ੇਦਾਰ ...

--- gujarati ---
ID: gujarati_000000
Duration: 11.16s
Transcript: છોડ ઓક્સિજન બનાવે છે જેનેથી મનુષ્ય શ્વાસ લે છે અને તેઓ કાર્બન-ડાયોક્સાઇડ લે છે જ...

--- bengali ---
ID: bengali_000000
Duration: 10.44s
Transcript: একজন শুধুমাত্র আশ্চর্য হতে পারে এই ভেবে যে যখন নতুন কিছু আসে তখন কী-বোর্ড কীরকম ...
$