Now the question is: what to do about the 3-codebook final-export data? The TTS model expects 50 tok/s. We have two options: 1. Use only josh/hifitts/indicvoices/etc (964 shards, ~38.7k hours, 50 tok/s) — correct format 2. Deinterleave final-export (take every 3rd token for codebook 0) — needs validation Let me check what the non-final-export datasets give us: ● Bash(python3 -c " import psycopg2, os…) ⎿ === 50 tok/s datasets (non-final-export) === Dataset Lang Shards Segments Hours ears en 2 17,224 100.0 … +26 lines (ctrl+o to expand) ● Here's the situation: ┌────────────────────────────────────────────┬────────┬──────────┬─────────┬───────────────────────┐ │ Dataset Group │ Shards │ Segments │ Hours │ Token Rate │ ├────────────────────────────────────────────┼────────┼──────────┼─────────┼───────────────────────┤ │ 50 tok/s (josh, hifitts, indicvoices, etc) │ 964 │ 14.0M │ 38,727 │ Correct for TTS model │ ├────────────────────────────────────────────┼────────┼──────────┼─────────┼───────────────────────┤ │ 150 tok/s (final-export, 3 codebooks) │ 4,350 │ 60.7M │ 109,160 │ Wrong format │ └────────────────────────────────────────────┴────────┴──────────┴─────────┴───────────────────────┘ The final-export is 3x the data but uses 3 interleaved codebooks. The TTS model expects single-codebook 50 tok/s. Two paths forward: 1. Safe: Use only the 964 correct shards (14M segments, 38.7k hours) — still a massive dataset 2. Bold: Deinterleave final-export to extract codebook 0 only → 50 tok/s, but needs validation that codebook 0 alone produces good audio Which approach do you want? Or should I test deinterleaving first? ✻ Cogitated for 6m 7s ❯ i suppose it is not recoverable now are they ? ● Right — the 3-codebook encoding is a fundamentally different quantization from single-codebook. You can't extract one codebook and get the same result as encoding with 1 codebook from scratch. The original audio is still in R2 (audio.tar per shard), but re-encoding 109k hours would need the GPU fleet again. What we have right now that's usable: - 964 shards, 14M segments, 38,727 hours at correct 50 tok/s single-codebook neucodec - All 12 languages represented (English heavy via hifitts2, Indic via josh/joshdelivery/indicvoices) Let me run the pipeline on just the correct data — with the .flac fix applied: ● Bash(python3 -c " import psycopg2, os…) ⎿ Lang Segments Hours en 10,744,197 29,362 gu 666,228 1,824 … +10 lines (ctrl+o to expand) ● So the usable 50 tok/s data is heavily English (29k hrs) with Indic languages between 42-1,824 hours. Still 38.7k hours total — a solid dataset. Want me to proceed with this, or do you want to explore re-encoding the final-export data first? ✻ Brewed for 33s ❯ provide the situation and give context, i'll take it to the agent who worked on encoding with horizontal scaling. ● Situation Brief for the Encoding Team What happened The neucodec encoding pipeline encoded 5,314 shards across 12 languages. However, two different codec configurations were used: ┌──────────────────────────────────────────────────────────┬────────┬──────────┬─────────┬─────────┬─────────────┬───────────┐ │ Dataset Group │ Shards │ Segments │ Hours │ Token │ Codebooks │ Status │ │ │ │ │ │ Rate │ │ │ ├──────────────────────────────────────────────────────────┼────────┼──────────┼─────────┼─────────┼─────────────┼───────────┤ │ final-export │ 4,350 │ 60.7M │ 109,160 │ 150 │ 3 │ Wrong │ │ │ │ │ │ tok/s │ interleaved │ format │ ├──────────────────────────────────────────────────────────┼────────┼──────────┼─────────┼─────────┼─────────────┼───────────┤ │ josh, joshdelivery, hifitts2, indicvoices, │ │ │ │ 50 │ │ Ready to │ │ indicvoices-r, globe, ears, vctk, librittsr, ljspeech, │ 964 │ 14.0M │ 38,727 │ tok/s │ 1 (correct) │ train │ │ expresso │ │ │ │ │ │ │ └──────────────────────────────────────────────────────────┴────────┴──────────┴─────────┴─────────┴─────────────┴───────────┘ The problem The TTS model (Scicom-intl/Multilingual-TTS-1.7B-Base) expects single-codebook neucodec at 50 tokens/sec. The final-export dataset (78% of all data, 109k hours) was encoded at 150 tok/s with 3 interleaved codebooks. These tokens cannot be used directly — the model's vocabith single-codebook neucodec (50 tok/s): - Source audio: s3://finalsftdata//audio.tar (still in R2) - Target: neucodec_tokens.parquet at 50 tok/s, matching the format of josh/hifitts shards - Scope: 60.7M segments, 109,160 hours across 12 languages - Docker image bharathkumar192/neucodec-worker:latest is still on Docker Hub - Supabase tracking tables still active — could reset status for final-export shards and re-run - fleet.py launch N can spin up a new Vast.ai fleet Language breakdown of what needs re-encoding en: 1,231 shards 17.4M segs 32,301 hrs hi: 698 shards 9.9M segs 16,890 hrs te: 585 shards 8.2M segs 13,871 hrs ml: 372 shards 5.2M segs 9,554 hrs pa: 343 shards 4.8M segs 8,796 hrs ta: 312 shards 4.2M segs 7,997 hrs kn: 200 shards 2.8M segs 5,243 hrs gu: 186 shards 2.6M segs 4,488 hrs bn: 158 shards 2.2M segs 3,721 hrs mr: 148 shards 2.0M segs 3,697 hrs or: 78 shards 946K segs 1,800 hrs as: 39 shards 418K segs 804 hrs What we CAN train on right now 964 shards with correct 50 tok/s encoding — 14M segments, 38,727 hours. Pipeline is built, smoke-tested, and ready to go. Key config to verify with encoding team The worker.py neucodec model config — what parameter controls 1 vs 3 codebooks? The original encoding used the correct config for josh/hifitts/indicvoices but a different config for final-export. Need to ensure re-encoding uses the single-codebook FSQ configuration matching neuphonic/neucodec.