--- pid: -1 cwd: / last_command: | cd /home/ubuntu/xcodec2_16khz_indic && echo "=== TRANSCRIPT SAMPLES PER LANGUAGE ===" && for lang in telugu hindi english tamil kannada malayalam assamese odia marathi punjabi gujarati bengali; do echo -e "\n--- $lang ---" grep "\"language\": \"$lang\"" data/evaluation/evaluation_transcripts.jsonl | head -1 | python3 -c "import sys,json; d=json.loads(sys.stdin.read()); print(f'ID: {d[\"id\"]}'); print(f'Duration: {d[\"duration\"]}s'); print(f'Transcript: {d[\"transcription\"][:80]}...')" done last_exit_code: 141 --- ============================================================ Downloading from FLEURS... Loading test split... ✅ assamese: 500 samples (1.41 hours) ============================================================ 📥 Processing: ODIA ============================================================ Downloading from FLEURS... Loading test split... Downloading data: 87%|████████▋ | 199M/230M [00:01<00:00, 184MB/s]$ sleep 60 && tail -80 /home/ubuntu/.cursor/projects/home-ubuntu-xcodec2-16khz-indic/terminals/4.txt 📥 DOWNLOADING EVALUATION DATASETS FOR XCODEC2 INDIC ====================================================================== Output directory: /home/ubuntu/xcodec2_16khz_indic/data/evaluation Languages: telugu, hindi, english, tamil, kannada, malayalam, assamese, odia, marathi, punjabi, gujarati, bengali Samples per language: 500 ============================================================ 📥 Processing: TELUGU ============================================================ Downloading from FLEURS... Loading test split... Loading validation split... telugu/validation: 8%|▊ | 26/311 [00:00<00:01, 252.45it/s telugu/validation: 17%|█▋ | 52/311 [00:00<00:01, 238.20it/s telugu/validation: 25%|██▌ | 79/311 [00:00<00:00, 250.17it/s ✅ telugu: 500 samples (1.38 hours) ============================================================ 📥 Processing: HINDI ============================================================ Downloading from FLEURS... Loading test split... Loading validation split... hindi/validation: 45%|████▍ | 107/239 [00:00<00:00, 256.15it/s hindi/validation: 57%|█████▋ | 136/239 [00:00<00:00, 266.42it/s hindi/validation: 69%|██████▉ | 165/239 [00:00<00:00, 271.15it/s hindi/validation: 81%|████████ | 193/239 [00:00<00:00, 268.07it/s ✅ hindi: 500 samples (1.36 hours) ============================================================ 📥 Processing: ENGLISH ============================================================ Using LibriSpeech test-clean (studio quality)... Loading LibriSpeech test-clean... english/librispeech: 1%| | 22/2620 [00:00<00:12, 214.83i english/librispeech: 2%|▏ | 44/2620 [00:00<00:12, 200.75i english/librispeech: 3%|▎ | 66/2620 [00:00<00:12, 207.78i english/librispeech: 3%|▎ | 89/2620 [00:00<00:11, 214.12i english/librispeech: 4%|▍ | 111/2620 [00:00<00:11, 215.95 english/librispeech: 5%|▌ | 133/2620 [00:00<00:11, 210.59 english/librispeech: 6%|▌ | 155/2620 [00:00<00:11, 211.48 english/librispeech: 7%|▋ | 177/2620 [00:00<00:12, 190.48 english/librispeech: 8%|▊ | 204/2620 [00:00<00:11, 210.30 english/librispeech: 9%|▉ | 232/2620 [00:01<00:10, 228.36 english/librispeech: 10%|▉ | 257/2620 [00:01<00:10, 233.31 english/librispeech: 11%|█ | 281/2620 [00:01<00:10, 216.43 english/librispeech: 12%|█▏ | 304/2620 [00:01<00:11, 210.01 english/librispeech: 12%|█▏ | 326/2620 [00:01<00:11, 196.23 english/librispeech: 13%|█▎ | 353/2620 [00:01<00:10, 214.42 english/librispeech: 15%|█▍ | 380/2620 [00:01<00:09, 228.30 english/librispeech: 15%|█▌ | 404/2620 [00:01<00:09, 228.84 english/librispeech: 16%|█▋ | 428/2620 [00:02<00:10, 203.98 english/librispeech: 17%|█▋ | 455/2620 [00:02<00:09, 218.68 english/librispeech: 18%|█▊ | 480/2620 [00:02<00:09, 226.05 english/librispeech: 19%|█▉ | 504/2620 [00:02<00:09, 215.69 english/librispeech: 20%|██ | 527/2620 [00:02<00:10, 207.85 english/librispeech: 21%|██ | 549/2620 [00:02<00:10, 205.44 english/librispeech: 22%|██▏ | 571/2620 [00:02<00:09, 208.84 ✅ english: 500 samples (0.95 hours) ============================================================ 📥 Processing: TAMIL ============================================================ Downloading from FLEURS... Loading test split... Loading validation split... ✅ tamil: 500 samples (1.41 hours) ============================================================ 📥 Processing: KANNADA ============================================================ Downloading from FLEURS... Loading test split... ✅ kannada: 500 samples (1.54 hours) ============================================================ 📥 Processing: MALAYALAM ============================================================ Downloading from FLEURS... Loading test split... ✅ malayalam: 500 samples (1.55 hours) ============================================================ 📥 Processing: ASSAMESE ============================================================ Downloading from FLEURS... Loading test split... ✅ assamese: 500 samples (1.41 hours) ============================================================ 📥 Processing: ODIA ============================================================ Downloading from FLEURS... Loading test split... Downloading data: 100%|██████████| 230M/230M [00:01<00:00, 148MB/s] Downloading data: 100%|██████████| 550M/550M [00:03<00:00, 175MB/s] Downloading data: 100%|██████████| 1.30M/1.30M [00:00<00:00, 5.28MB/s] Downloading data: 100%|██████████| 460k/460k [00:00<00:00, 2.79MB/s] Downloading data: 100%|██████████| 1.12M/1.12M [00:00<00:00, 5.01MB/s] Generating train split: 1081 examples [00:07, 136.39 examples/s] Generating validation split: 392 examples [00:02, 159.02 examples/s] Generating test split: 883 examples [00:05, 151.08 examples/s] ✅ odia: 500 samples (1.44 hours) ============================================================ 📥 Processing: MARATHI ============================================================ Downloading from FLEURS... Loading test split... Downloading data: 0%| | 0.00/2.20G [00:00/dev/null || true; builtin export PWD="$(builtin pwd)"; builtin eval "$1" < /dev/null; }; COMMAND_EXIT_CODE=$?; dump_bash_state >&4; builtin exit $COMMAND_EXIT_CODE -- cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && HF_TOKEN=$(grep HF_TOKEN .env | cut -d= -f2) python scripts/data_prep/prepare_evaluation_data.py --samples-per-lang 500 2>&1 ubuntu 1641142 85.1 3.7 13182644 4641576 ? Rl 01:18 5:15 python scripts/data_prep/prepare_evaluation_data.py --samples-per-lang 500 ubuntu 1645058 0.0 0.0 9352 5368 ? Ss 01:24 0:00 /bin/bash -O extglob -c snap=$(command cat <&3) && builtin shopt -s extglob && builtin eval -- "$snap" && { builtin set +u 2>/dev/null || true; builtin export PWD="$(builtin pwd)"; builtin eval "$1" < /dev/null; }; COMMAND_EXIT_CODE=$?; dump_bash_state >&4; builtin exit $COMMAND_EXIT_CODE -- ps aux | grep prepare_evaluation || echo "Process might have finished" ubuntu 1645074 0.0 0.0 6612 2560 ? S 01:24 0:00 grep --color=auto prepare_evaluation $ sleep 120 && cat /home/ubuntu/.cursor/projects/home-ubuntu-xcodec2-16khz-indic/terminals/4.txt | tail -40 Loading test split... Downloading data: 100%|██████████| 2.03G/2.03G [00:10<00:00, 200MB/s] Downloading data: 100%|██████████| 279M/279M [00:01<00:00, 147MB/s] Downloading data: 100%|██████████| 660M/660M [00:03<00:00, 180MB/s] Downloading data: 100%|██████████| 3.48M/3.48M [00:00<00:00, 8.47MB/s] Downloading data: 100%|██████████| 466k/466k [00:00<00:00, 2.77MB/s] Downloading data: 100%|██████████| 1.09M/1.09M [00:00<00:00, 4.46MB/s] Generating train split: 3006 examples [00:23, 126.39 examples/s] Generating validation split: 402 examples [00:03, 126.13 examples/s] Generating test split: 920 examples [00:07, 127.96 examples/s] ✅ bengali: 500 samples (1.53 hours) ====================================================================== 📋 CREATING COMBINED EVALUATION MANIFEST ====================================================================== ====================================================================== 📊 EVALUATION DATA SUMMARY ====================================================================== Language Samples Hours Source -------------------------------------------------- telugu 500 1.38 fleurs hindi 500 1.36 fleurs english 500 0.95 librispeech tamil 500 1.41 fleurs kannada 500 1.54 fleurs malayalam 500 1.55 fleurs assamese 500 1.41 fleurs odia 500 1.44 fleurs marathi 500 1.51 fleurs punjabi 500 1.35 fleurs gujarati 500 1.35 fleurs bengali 500 1.53 fleurs -------------------------------------------------- TOTAL 6000 16.74 ✅ Manifest: /home/ubuntu/xcodec2_16khz_indic/data/evaluation/evaluation_manifest.json ✅ TSV: /home/ubuntu/xcodec2_16khz_indic/data/evaluation/evaluation.tsv $$ cd /home/ubuntu/xcodec2_16khz_indic && ls -la data/evaluation/*/audio | head -20 data/evaluation/assamese/audio: total 159408 drwxrwxr-x 2 ubuntu ubuntu 20480 Jan 23 00:52 . drwxrwxr-x 3 ubuntu ubuntu 4096 Jan 23 00:40 .. -rw-rw-r-- 1 ubuntu ubuntu 303404 Jan 23 01:18 000000.wav -rw-rw-r-- 1 ubuntu ubuntu 368684 Jan 23 01:18 000001.wav -rw-rw-r-- 1 ubuntu ubuntu 266924 Jan 23 01:18 000002.wav -rw-rw-r-- 1 ubuntu ubuntu 361004 Jan 23 01:18 000003.wav -rw-rw-r-- 1 ubuntu ubuntu 266924 Jan 23 01:18 000004.wav -rw-rw-r-- 1 ubuntu ubuntu 236204 Jan 23 01:18 000005.wav -rw-rw-r-- 1 ubuntu ubuntu 395564 Jan 23 01:18 000006.wav -rw-rw-r-- 1 ubuntu ubuntu 301484 Jan 23 01:18 000007.wav -rw-rw-r-- 1 ubuntu ubuntu 355244 Jan 23 01:18 000008.wav -rw-rw-r-- 1 ubuntu ubuntu 259244 Jan 23 01:18 000009.wav -rw-rw-r-- 1 ubuntu ubuntu 243884 Jan 23 01:18 000010.wav -rw-rw-r-- 1 ubuntu ubuntu 265004 Jan 23 01:18 000011.wav -rw-rw-r-- 1 ubuntu ubuntu 391724 Jan 23 01:18 000012.wav -rw-rw-r-- 1 ubuntu ubuntu 330284 Jan 23 01:18 000013.wav -rw-rw-r-- 1 ubuntu ubuntu 207404 Jan 23 01:18 000014.wav -rw-rw-r-- 1 ubuntu ubuntu 355244 Jan 23 01:18 000015.wav $ cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && python scripts/data_prep/create_and_upload_eval_shards.py 2>&1 ====================================================================== 📦 CREATING EVALUATION SHARDS AND UPLOADING TO R2 ====================================================================== Evaluation data: /home/ubuntu/xcodec2_16khz_indic/data/evaluation R2 Bucket: xcodec R2 Prefix: evaluation/ 🔗 Connecting to R2... ✅ Connected to bucket: xcodec 📁 Processing: TELUGU Creating shard for telugu: 41%|████▏ | 207/500 [00:00<00:00, 206 Creating shard for telugu: 83%|████████▎ | 414/500 [00:00<00:00, 205 ✅ Created shard: 500 samples, 152.80 MB ☁️ Uploading to R2: evaluation/telugu_eval.tar... ✅ Uploaded successfully 📁 Processing: HINDI Creating shard for hindi: 42%|████▏ | 211/500 [00:00<00:00, 2097 Creating shard for hindi: 88%|████████▊ | 441/500 [00:00<00:00, 2208 ✅ Created shard: 500 samples, 150.39 MB ☁️ Uploading to R2: evaluation/hindi_eval.tar... ✅ Uploaded successfully 📁 Processing: ENGLISH Creating shard for english: 53%|█████▎ | 265/500 [00:00<00:00, 26 ✅ Created shard: 500 samples, 105.46 MB ☁️ Uploading to R2: evaluation/english_eval.tar... ✅ Uploaded successfully 📁 Processing: TAMIL Creating shard for tamil: 39%|███▉ | 196/500 [00:00<00:00, 1952 Creating shard for tamil: 82%|████████▏ | 412/500 [00:00<00:00, 2068 ✅ Created shard: 500 samples, 156.50 MB ☁️ Uploading to R2: evaluation/tamil_eval.tar... ✅ Uploaded successfully 📁 Processing: KANNADA Creating shard for kannada: 39%|███▉ | 197/500 [00:00<00:00, 19 Creating shard for kannada: 81%|████████ | 405/500 [00:00<00:00, 20 ✅ Created shard: 500 samples, 170.39 MB ☁️ Uploading to R2: evaluation/kannada_eval.tar... ✅ Uploaded successfully 📁 Processing: MALAYALAM Creating shard for malayalam: 41%|████ | 206/500 [00:00<00:00, Creating shard for malayalam: 83%|████████▎ | 416/500 [00:00<00:00, ✅ Created shard: 500 samples, 171.19 MB ☁️ Uploading to R2: evaluation/malayalam_eval.tar... ✅ Uploaded successfully 📁 Processing: ASSAMESE Creating shard for assamese: 45%|████▍ | 224/500 [00:00<00:00, 2 Creating shard for assamese: 90%|████████▉ | 448/500 [00:00<00:00, 2 ✅ Created shard: 500 samples, 156.04 MB ☁️ Uploading to R2: evaluation/assamese_eval.tar... ✅ Uploaded successfully 📁 Processing: ODIA Creating shard for odia: 43%|████▎ | 214/500 [00:00<00:00, 2139. Creating shard for odia: 86%|████████▌ | 431/500 [00:00<00:00, 2155. ✅ Created shard: 500 samples, 159.12 MB ☁️ Uploading to R2: evaluation/odia_eval.tar... ✅ Uploaded successfully 📁 Processing: MARATHI Creating shard for marathi: 42%|████▏ | 208/500 [00:00<00:00, 20 Creating shard for marathi: 85%|████████▌ | 425/500 [00:00<00:00, 21 ✅ Created shard: 500 samples, 167.05 MB ☁️ Uploading to R2: evaluation/marathi_eval.tar... ✅ Uploaded successfully 📁 Processing: PUNJABI Creating shard for punjabi: 46%|████▌ | 228/500 [00:00<00:00, 22 Creating shard for punjabi: 94%|█████████▎| 468/500 [00:00<00:00, 23 ✅ Created shard: 500 samples, 149.30 MB ☁️ Uploading to R2: evaluation/punjabi_eval.tar... ✅ Uploaded successfully 📁 Processing: GUJARATI Creating shard for gujarati: 44%|████▍ | 222/500 [00:00<00:00, 2 Creating shard for gujarati: 90%|█████████ | 451/500 [00:00<00:00, 2 ✅ Created shard: 500 samples, 149.36 MB ☁️ Uploading to R2: evaluation/gujarati_eval.tar... ✅ Uploaded successfully 📁 Processing: BENGALI Creating shard for bengali: 43%|████▎ | 214/500 [00:00<00:00, 21 Creating shard for bengali: 86%|████████▌ | 428/500 [00:00<00:00, 21 ✅ Created shard: 500 samples, 169.05 MB ☁️ Uploading to R2: evaluation/bengali_eval.tar... ✅ Uploaded successfully 📋 Uploading evaluation manifest... ✅ Manifest uploaded: evaluation/evaluation_manifest.json ✅ Index uploaded: evaluation/index.json ====================================================================== 📊 UPLOAD SUMMARY ====================================================================== Language Samples Size (MB) R2 Key ---------------------------------------------------------------------- telugu 500 152.80 evaluation/telugu_eval.tar hindi 500 150.39 evaluation/hindi_eval.tar english 500 105.46 evaluation/english_eval.tar tamil 500 156.50 evaluation/tamil_eval.tar kannada 500 170.39 evaluation/kannada_eval.tar malayalam 500 171.19 evaluation/malayalam_eval.tar assamese 500 156.04 evaluation/assamese_eval.tar odia 500 159.12 evaluation/odia_eval.tar marathi 500 167.05 evaluation/marathi_eval.tar punjabi 500 149.30 evaluation/punjabi_eval.tar gujarati 500 149.36 evaluation/gujarati_eval.tar bengali 500 169.05 evaluation/bengali_eval.tar ---------------------------------------------------------------------- TOTAL 6000 1856.65 ✅ Uploaded 12 shards to R2 Bucket: xcodec Prefix: evaluation/ 🔗 Access shards at: - xcodec/evaluation/telugu_eval.tar - xcodec/evaluation/hindi_eval.tar - xcodec/evaluation/english_eval.tar - xcodec/evaluation/tamil_eval.tar - xcodec/evaluation/kannada_eval.tar - xcodec/evaluation/malayalam_eval.tar - xcodec/evaluation/assamese_eval.tar - xcodec/evaluation/odia_eval.tar - xcodec/evaluation/marathi_eval.tar - xcodec/evaluation/punjabi_eval.tar - xcodec/evaluation/gujarati_eval.tar - xcodec/evaluation/bengali_eval.tar $ cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && python -c " import boto3 from botocore.config import Config from dotenv import load_dotenv import os load_dotenv() s3 = boto3.client( 's3', endpoint_url=os.getenv('R2_ENDPOINT_URL'), aws_access_key_id=os.getenv('R2_ACCESS_KEY_ID'), aws_secret_access_key=os.getenv('R2_SECRET_ACCESS_KEY'), config=Config(signature_version='s3v4', s3={'addressing_style': 'path'}) ) print('📂 Files in xcodec/evaluation/:') response = s3.list_objects_v2(Bucket='xcodec', Prefix='evaluation/') for obj in response.get('Contents', []): size_mb = obj['Size'] / 1024 / 1024 print(f' {obj[\"Key\"]:45} {size_mb:>8.2f} MB') " 📂 Files in xcodec/evaluation/: evaluation/assamese_eval.tar 156.04 MB evaluation/bengali_eval.tar 169.05 MB evaluation/english_eval.tar 105.46 MB evaluation/evaluation_manifest.json 2.85 MB evaluation/gujarati_eval.tar 149.36 MB evaluation/hindi_eval.tar 150.39 MB evaluation/index.json 0.00 MB evaluation/kannada_eval.tar 170.39 MB evaluation/malayalam_eval.tar 171.19 MB evaluation/marathi_eval.tar 167.05 MB evaluation/odia_eval.tar 159.12 MB evaluation/punjabi_eval.tar 149.30 MB evaluation/tamil_eval.tar 156.50 MB evaluation/telugu_eval.tar 152.80 MB $ cd /home/ubuntu/xcodec2_16khz_indic && head -100 data/evaluation/telugu/metadata.json { "language": "telugu", "source": "fleurs", "lang_code": "te_in", "num_samples": 500, "total_duration_hours": 1.3778, "samples": [ { "id": "telugu_000000", "path": "telugu/audio/000000.wav", "duration": 8.28, "language": "telugu", "source": "fleurs", "split": "test", "transcription": "చిన్న ద్వీపాలలో చాలా వరకు స్వతంత్ర దేశాలు లేదా ఫ్రాన్స్ తో సంబంధం కలిగి ఉన్నాయి ఇంకా వీటిని లగ్జరీ బీచ్ రిసార్ట్ స్ అని పిలుస్తారు" }, { "id": "telugu_000001", "path": "telugu/audio/000001.wav", "duration": 6.72, "language": "telugu", "source": "fleurs", "split": "test", "transcription": "కొన్ని క్రియలు ఆబ్జెక్టుల మధ్య తేడాను గుర్తించడానికి ఇది ఒక ముఖ్యమైన మార్గం" }, { "id": "telugu_000002", "path": "telugu/audio/000002.wav", "duration": 10.08, "language": "telugu", "source": "fleurs", "split": "test", "transcription": "మీరు మీ స్వంతం ఆలోచనలతో కాకుండా ప్రభుత్వాల సలహా కూడా తీసుకోవాలని అనుకోవచ్చు అయితే వారి సలహా వారి పౌరుల కోసం రూపొందించబడింది" }, { "id": "telugu_000003", "path": "telugu/audio/000003.wav", "duration": 5.94, "language": "telugu", "source": "fleurs", "split": "test", "transcription": "ఈ దృశ్యాలు పిరమిడ్లపై ప్రదర్శించబడతాయి మరియు వేరే పిరమిడ్లను వెలిగించబడ్డాయి" }, { "id": "telugu_000004", "path": "telugu/audio/000004.wav", "duration": 10.62, "language": "telugu", "source": "fleurs", "split": "test", "transcription": "పరికరాన్ని గాల్లో కదిలించడం ద్వారా వీడియో గేమ్స్లో యాక్షన్లు మరియు కదలికలను నియంత్రించడానికి ఇది ఆటగాళ్లకు వీలు కల్పిస్తుంది" }, { "id": "telugu_000005", "path": "telugu/audio/000005.wav", "duration": 7.08, "language": "telugu", "source": "fleurs", "split": "test", "transcription": "కాల్షియం పొటాషియం వంటి మూలకాలను లోహాలుగా పరిగణిస్తారు వెండి బంగారం వంటి లోహాలు కూడా ఉన్నాయి" }, { "id": "telugu_000006", "path": "telugu/audio/000006.wav", "duration": 5.4, "language": "telugu", "source": "fleurs", "split": "test", "transcription": "ఆ విధంగా పెన్సిల్ బయటకు వచ్చినప్పుడు చాలా మందికి మంచి స్నేహితుడు" }, { "id": "telugu_000007", "path": "telugu/audio/000007.wav", "duration": 8.4, "language": "telugu", "source": "fleurs", "split": "test", "transcription": "బుధవారం జరిగిన ఈవెంట్కు ముందు carpanedo ఛాంపియన్షిప్లో రెండు వ్యక్తిగత రేసులలో పాల్గొన్నారు" }, { "id": "telugu_000008", "path": "telugu/audio/000008.wav", "duration": 12.12, "language": "telugu", "source": "fleurs", "split": "test", "transcription": "పట్టాలను నిలబెట్టడానికి క్రాస్ టీలు చాలా ముందుగానే ప్రవేశపెట్టబడ్డాయి అయితే పైభాగంలో ఇనుప ఖనిజం ఉంటే ట్రాక్ లు మరింత సమర్థవంతంగా ఉంటాయని క్రమంగా గ్రహించారు" }, { "id": "telugu_000009", "path": "telugu/audio/000009.wav", "duration": 12.18, "language": "telugu", "source": "fleurs", "split": "test", "transcription": "ఇది బేస్ స్టేషన్ లో డ్యూయల్ రేడియోలు ఉంటే 802.11a 802.11b మరియు 802.11gతో అనుకూలంగా వెనుకకు అనుమతిస్తుంది" }, { "id": "telugu_000010", "path": "telugu/audio/000010.wav", $ cd /home/ubuntu/xcodec2_16khz_indic && head -100 data/evaluation/english/metadata.json { "language": "english", "source": "librispeech", "num_samples": 500, "total_duration_hours": 0.9474, "samples": [ { "id": "english_000000", "path": "english/audio/000000.wav", "duration": 3.505, "language": "english", "source": "librispeech", "transcription": "CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS" }, { "id": "english_000001", "path": "english/audio/000001.wav", "duration": 14.225, "language": "english", "source": "librispeech", "transcription": "THE ENGLISH FORWARDED TO THE FRENCH BASKETS OF FLOWERS OF WHICH THEY HAD MADE A PLENTIFUL PROVISION TO GREET THE ARRIVAL OF THE YOUNG PRINCESS THE FRENCH IN RETURN INVITED THE ENGLISH TO A SUPPER WHICH WAS TO BE GIVEN THE NEXT DAY" }, { "id": "english_000002", "path": "english/audio/000002.wav", "duration": 5.025, "language": "english", "source": "librispeech", "transcription": "CONGRATULATIONS WERE POURED IN UPON THE PRINCESS EVERYWHERE DURING HER JOURNEY" }, { "id": "english_000003", "path": "english/audio/000003.wav", "duration": 11.065, "language": "english", "source": "librispeech", "transcription": "SHE TAUGHT HER DAUGHTER THEN BY HER OWN AFFECTION FOR IT THAT LOVE FOR A COUNTRY WHERE THEY HAD BOTH BEEN HOSPITABLY RECEIVED AND WHERE A BRILLIANT FUTURE OPENED BEFORE THEM" }, { "id": "english_000004", "path": "english/audio/000004.wav", "duration": 13.16, "language": "english", "source": "librispeech", "transcription": "THE COUNT HAD THROWN HIMSELF BACK ON HIS SEAT LEANING HIS SHOULDERS AGAINST THE PARTITION OF THE TENT AND REMAINED THUS HIS FACE BURIED IN HIS HANDS WITH HEAVING CHEST AND RESTLESS LIMBS" }, { "id": "english_000005", "path": "english/audio/000005.wav", "duration": 5.85, "language": "english", "source": "librispeech", "transcription": "THIS HAS INDEED BEEN A HARASSING DAY CONTINUED THE YOUNG MAN HIS EYES FIXED UPON HIS FRIEND" }, { "id": "english_000006", "path": "english/audio/000006.wav", "duration": 3.315, "language": "english", "source": "librispeech", "transcription": "YOU WILL BE FRANK WITH ME I ALWAYS AM" }, { "id": "english_000007", "path": "english/audio/000007.wav", "duration": 4.785, "language": "english", "source": "librispeech", "transcription": "CAN YOU IMAGINE WHY BUCKINGHAM HAS BEEN SO VIOLENT I SUSPECT" }, { "id": "english_000008", "path": "english/audio/000008.wav", "duration": 7.28, "language": "english", "source": "librispeech", "transcription": "IT IS YOU WHO ARE MISTAKEN RAOUL I HAVE READ HIS DISTRESS IN HIS EYES IN HIS EVERY GESTURE AND ACTION THE WHOLE DAY" }, { "id": "english_000009", "path": "english/audio/000009.wav", "duration": 3.035, "language": "english", "source": "librispeech", "transcription": "I CAN PERCEIVE LOVE CLEARLY ENOUGH" }, { "id": "english_000010", "path": "english/audio/000010.wav", "duration": 3.195, "language": "english", "source": "librispeech", "transcription": "I AM CONVINCED OF WHAT I SAY SAID THE COUNT" }, { "id": "english_000011", "path": "english/audio/000011.wav", "duration": 2.94, "language": "english", "source": "librispeech", $ cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && python -c " import json from pathlib import Path eval_dir = Path('data/evaluation') output_lines = [] languages = ['telugu', 'hindi', 'english', 'tamil', 'kannada', 'malayalam', 'assamese', 'odia', 'marathi', 'punjabi', 'gujarati', 'bengali'] for lang in languages: metadata_path = eval_dir / lang / 'metadata.json' if not metadata_path.exists(): continue with open(metadata_path) as f: metadata = json.load(f) for sample in metadata.get('samples', []): entry = { 'id': sample['id'], 'audio_path': f'evaluation/{lang}_eval.tar::{sample[\"id\"]}.wav', # R2 path 'local_path': sample['path'], # local relative path 'language': sample['language'], 'duration': sample['duration'], 'transcription': sample.get('transcription', ''), 'source': sample.get('source', 'unknown'), } output_lines.append(json.dumps(entry, ensure_ascii=False)) # Write JSONL output_path = eval_dir / 'evaluation_transcripts.jsonl' with open(output_path, 'w', encoding='utf-8') as f: f.write('\n'.join(output_lines)) print(f'Created: {output_path}') print(f'Total entries: {len(output_lines)}') print(f'File size: {output_path.stat().st_size / 1024:.2f} KB') # Show sample entries print('\n--- Sample entries ---') for lang in ['telugu', 'hindi', 'english']: for line in output_lines: entry = json.loads(line) if entry['language'] == lang: print(f'{entry[\"id\"]}: {entry[\"transcription\"][:60]}...') break " Created: data/evaluation/evaluation_transcripts.jsonl Total entries: 6000 File size: 2932.68 KB --- Sample entries --- telugu_000000: చిన్న ద్వీపాలలో చాలా వరకు స్వతంత్ర దేశాలు లేదా ఫ్రాన్స్ తో స... hindi_000000: कुछ अणुओं में अस्थिर केंद्रक होता है जिसका मतलब यह है कि उनम... english_000000: CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS... $ cd /home/ubuntu/xcodec2_16khz_indic && source venv/bin/activate && python -c " import boto3 from botocore.config import Config from dotenv import load_dotenv import os from pathlib import Path load_dotenv() s3 = boto3.client( 's3', endpoint_url=os.getenv('R2_ENDPOINT_URL'), aws_access_key_id=os.getenv('R2_ACCESS_KEY_ID'), aws_secret_access_key=os.getenv('R2_SECRET_ACCESS_KEY'), config=Config(signature_version='s3v4', s3={'addressing_style': 'path'}) ) # Upload transcripts JSONL jsonl_path = Path('data/evaluation/evaluation_transcripts.jsonl') r2_key = 'evaluation/evaluation_transcripts.jsonl' with open(jsonl_path, 'rb') as f: s3.upload_fileobj(f, 'xcodec', r2_key, ExtraArgs={'ContentType': 'application/jsonl'}) print(f'✅ Uploaded: xcodec/{r2_key}') print(f' Size: {jsonl_path.stat().st_size / 1024:.2f} KB') # Also upload the full manifest again (with transcripts) manifest_path = Path('data/evaluation/evaluation_manifest.json') r2_key_manifest = 'evaluation/evaluation_manifest.json' with open(manifest_path, 'rb') as f: s3.upload_fileobj(f, 'xcodec', r2_key_manifest, ExtraArgs={'ContentType': 'application/json'}) print(f'✅ Uploaded: xcodec/{r2_key_manifest}') # List all files in evaluation/ print('\n📂 Updated files in xcodec/evaluation/:') response = s3.list_objects_v2(Bucket='xcodec', Prefix='evaluation/') for obj in response.get('Contents', []): size_mb = obj['Size'] / 1024 / 1024 if size_mb < 1: size_str = f'{obj[\"Size\"] / 1024:.2f} KB' else: size_str = f'{size_mb:.2f} MB' print(f' {obj[\"Key\"]:50} {size_str:>12}') " ✅ Uploaded: xcodec/evaluation/evaluation_transcripts.jsonl Size: 2932.68 KB ✅ Uploaded: xcodec/evaluation/evaluation_manifest.json 📂 Updated files in xcodec/evaluation/: evaluation/assamese_eval.tar 156.04 MB evaluation/bengali_eval.tar 169.05 MB evaluation/english_eval.tar 105.46 MB evaluation/evaluation_manifest.json 2.85 MB evaluation/evaluation_transcripts.jsonl 2.86 MB evaluation/gujarati_eval.tar 149.36 MB evaluation/hindi_eval.tar 150.39 MB evaluation/index.json 1.87 KB evaluation/kannada_eval.tar 170.39 MB evaluation/malayalam_eval.tar 171.19 MB evaluation/marathi_eval.tar 167.05 MB evaluation/odia_eval.tar 159.12 MB evaluation/punjabi_eval.tar 149.30 MB evaluation/tamil_eval.tar 156.50 MB evaluation/telugu_eval.tar 152.80 MB $ cd /home/ubuntu/xcodec2_16khz_indic && head -5 data/evaluation/evaluation_transcripts.jsonl | python -m json.tool --no-ensure-ascii 2>/dev/null || head -5 data/evaluation/evaluation_transcripts.jsonl {"id": "telugu_000000", "audio_path": "evaluation/telugu_eval.tar::telugu_000000.wav", "local_path": "telugu/audio/000000.wav", "language": "telugu", "duration": 8.28, "transcription": "చిన్న ద్వీపాలలో చాలా వరకు స్వతంత్ర దేశాలు లేదా ఫ్రాన్స్ తో సంబంధం కలిగి ఉన్నాయి ఇంకా వీటిని లగ్జరీ బీచ్ రిసార్ట్ స్ అని పిలుస్తారు", "source": "fleurs"} {"id": "telugu_000001", "audio_path": "evaluation/telugu_eval.tar::telugu_000001.wav", "local_path": "telugu/audio/000001.wav", "language": "telugu", "duration": 6.72, "transcription": "కొన్ని క్రియలు ఆబ్జెక్టుల మధ్య తేడాను గుర్తించడానికి ఇది ఒక ముఖ్యమైన మార్గం", "source": "fleurs"} {"id": "telugu_000002", "audio_path": "evaluation/telugu_eval.tar::telugu_000002.wav", "local_path": "telugu/audio/000002.wav", "language": "telugu", "duration": 10.08, "transcription": "మీరు మీ స్వంతం ఆలోచనలతో కాకుండా ప్రభుత్వాల సలహా కూడా తీసుకోవాలని అనుకోవచ్చు అయితే వారి సలహా వారి పౌరుల కోసం రూపొందించబడింది", "source": "fleurs"} {"id": "telugu_000003", "audio_path": "evaluation/telugu_eval.tar::telugu_000003.wav", "local_path": "telugu/audio/000003.wav", "language": "telugu", "duration": 5.94, "transcription": "ఈ దృశ్యాలు పిరమిడ్లపై ప్రదర్శించబడతాయి మరియు వేరే పిరమిడ్లను వెలిగించబడ్డాయి", "source": "fleurs"} {"id": "telugu_000004", "audio_path": "evaluation/telugu_eval.tar::telugu_000004.wav", "local_path": "telugu/audio/000004.wav", "language": "telugu", "duration": 10.62, "transcription": "పరికరాన్ని గాల్లో కదిలించడం ద్వారా వీడియో గేమ్స్లో యాక్షన్లు మరియు కదలికలను నియంత్రించడానికి ఇది ఆటగాళ్లకు వీలు కల్పిస్తుంది", "source": "fleurs"} $ cd /home/ubuntu/xcodec2_16khz_indic && echo "=== TRANSCRIPT SAMPLES PER LANGUAGE ===" && for lang in telugu hindi english tamil kannada malayalam assamese odia marathi punjabi gujarati bengali; do echo -e "\n--- $lang ---" grep "\"language\": \"$lang\"" data/evaluation/evaluation_transcripts.jsonl | head -1 | python3 -c "import sys,json; d=json.loads(sys.stdin.read()); print(f'ID: {d[\"id\"]}'); print(f'Duration: {d[\"duration\"]}s'); print(f'Transcript: {d[\"transcription\"][:80]}...')" done === TRANSCRIPT SAMPLES PER LANGUAGE === --- telugu --- ID: telugu_000000 Duration: 8.28s Transcript: చిన్న ద్వీపాలలో చాలా వరకు స్వతంత్ర దేశాలు లేదా ఫ్రాన్స్ తో సంబంధం కలిగి ఉన్నాయి ... --- hindi --- ID: hindi_000000 Duration: 9.12s Transcript: कुछ अणुओं में अस्थिर केंद्रक होता है जिसका मतलब यह है कि उनमें थोड़े या बिना किस... --- english --- ID: english_000000 Duration: 3.505s Transcript: CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS... --- tamil --- ID: tamil_000000 Duration: 7.8s Transcript: இது வேதியியல் ph என அழைக்கப்படுகிறது நீங்கள் சிவப்பு முட்டைக்கோஸ் சாற்றைப் பயன்ப... --- kannada --- ID: kannada_000000 Duration: 11.34s Transcript: ಆದರೆ ನಾಯಕನ ವಿಕೆಟ್ ಕಳೆದುಕೊಂಡ ನಂತರ ಭಾರತ 7 ವಿಕೆಟ್ ಕಳೆದುಕೊಂಡು ಕೇವಲ 36 ರನ್ಗಳಿಗೆ ತನ್ನ ... --- malayalam --- ID: malayalam_000000 Duration: 11.4s Transcript: ഈ നഗരം രാജ്യത്തെ മറ്റ് നഗരങ്ങളിൽ നിന്ന് പൂർണ്ണമായും വ്യത്യസ്തമാണ് കാരണം ഇതിന് ആഫ... --- assamese --- ID: assamese_000000 Duration: 9.48s Transcript: প্ৰতিবেদনটোত ইৰাকৰ কাৰ্যপালিকাৰ বৰ্তমানৰ প্ৰতিটো নীতিৰ অতি বেছি সমালোচনা কৰা হৈছ... --- odia --- ID: odia_000000 Duration: 10.86s Transcript: ଯେଉଁ ଯୁଗରେ ଘଟଣାଗୁଡ଼ିକ ଘଟିଥିଲା ​​ତାହାକୁ ସାଧାରଣତଃ 11ଶ 12ଶ ଏବଂ 13ଶ ଶତାବ୍ଦୀରେ ad 100... --- marathi --- ID: marathi_000000 Duration: 8.16s Transcript: पोलिस अधीक्षक चंद्र शेखर सोलंकी यांनी सांगितले की आरोपी चेहरा झाकून घेऊन कोर्टात... --- punjabi --- ID: punjabi_000000 Duration: 7.14s Transcript: ਪਿਰਾਮਿਡ ਦਾ ਸਾਊਂਡ ਅਤੇ ਲਾਈਟ ਸ਼ੋਅ ਇਸ ਖੇਤਰ ਅੰਦਰ ਬੱਚਿਆਂ ਲਈ ਕਰਨ ਵਾਲੀਆਂ ਸਭ ਤੋਂ ਮਜ਼ੇਦਾਰ ... --- gujarati --- ID: gujarati_000000 Duration: 11.16s Transcript: છોડ ઓક્સિજન બનાવે છે જેનેથી મનુષ્ય શ્વાસ લે છે અને તેઓ કાર્બન-ડાયોક્સાઇડ લે છે જ... --- bengali --- ID: bengali_000000 Duration: 10.44s Transcript: একজন শুধুমাত্র আশ্চর্য হতে পারে এই ভেবে যে যখন নতুন কিছু আসে তখন কী-বোর্ড কীরকম ... $