B, breathe for a second. You did **not** nuke the whole project.

You did **not** make a TTS-ending blunder. You made a **reproducibility and codec-side evaluation blunder**. Annoying, absolutely. Fatal, no. In an Inworld/Qwen-style SpeechLM pipeline, the **SpeechLM pretraining** is just next-token prediction over text/audio-token sequences, and **SpeechLM SFT** is masked autoregressive prediction of the output audio tokens. Inworld also explicitly says they reuse **pre-tokenized audio codes** across runs to reduce storage and I/O, which is exactly the kind of thing you did.  

What you **lost** by not keeping the original segments is this:

* you cannot do **codec training or codec fine-tuning** on that corpus anymore
* you cannot compute **intrusive waveform-reference metrics** like PESQ/STOI against the true original audio on the full corpus
* you cannot safely **change codec checkpoints** later without re-encoding

What you **did not lose** is this:

* SpeechLM pretraining
* SpeechLM SFT
* WER/CER evaluation on generated audio against target text
* MOS-style human eval
* DNSMOS/UTMOS-style non-intrusive eval
* speaker-sim eval if you still have a reference prompt audio for eval samples

So no, you are not fucked. You are just living through the consequences of “storage is expensive, future me can suffer.” A classic human pastime.

## First, the loss-function confusion

Separate these three things in your head:

1. **Codec losses**
   These belong to the **audio tokenizer / decoder** training stage, not the SpeechLM stage. Inworld’s codec training uses adversarial loss, feature matching, mel loss, plus an RMS loudness loss for volume consistency. Qwen’s tokenizer training also uses codec-side objectives like mel reconstruction, GAN losses, WavLM teacher guidance, and RVQ objectives. Those all need original waveforms.  

2. **SpeechLM PT / SFT losses**
   These are basically **cross-entropy over discrete audio tokens**. Inworld pretraining uses standard next-token prediction, and SFT uses autoregressive NLL over the target audio-token sequence conditioned on the prompt. Qwen’s report also frames the SpeechLM part as pretraining followed by post-training, not waveform reconstruction.  

3. **Evaluation / post-training rewards**
   WER, speaker similarity, DNSMOS, MOS, controllability scores. These are **not** your base PT/SFT loss. Inworld only brings WER, speaker similarity, and DNSMOS in later during **GRPO-based RL alignment**, after PT and SFT. Qwen does post-training with DPO/GSPO and speaker fine-tuning after the main pretraining stages.  

So your 5AM chicken-brain conclusion was actually mostly right: **WER is not a base pretraining loss**. Don’t let that panic steer the whole run.

---

## 1. Data mixture, what I’d actually do

### My blunt recommendation

For the **real run**, I would not burn the main budget on 0.6B. Use a **1.5B to 1.7B class model** if you can afford it. Qwen’s 1.7B variants beat the 0.6B ones on zero-shot generation, and Inworld’s larger model also beats the smaller one on quality metrics. Use 0.6B or 1B only as a **pipeline smoke test**.  

### Backbone choice

* **If you need lowest engineering risk today:** stick to the Inworld repo path and use the LLaMA-based backbone it already supports.
* **If you want the better multilingual bet for the real run:** move to a Qwen-ish 1.5B to 1.7B backbone.

Given your corpus is massively multilingual and Indic-heavy, the second option is the one I’d spend real compute on.

### Pretraining mixture

Your idea of including **10% of SFT into PT** is sane. Inworld mixed about **15k hours of high-quality SFT data** into pretraining to bootstrap alignment, which is the same general move. They also mixed in about **10% text-only data** during PT to preserve text understanding. 

I’d start with this:

**PT Stage S1, broad foundation**

* 80% speech-only PT corpus
* 10% high-quality paired TTS subset from SFT
* 10% text-only LM data

**PT Stage S2, quality continual pretraining**

* 60% cleaner speech-only PT subset
* 35% high-quality paired TTS
* 5% text-only

**PT Stage S3, long-context**

* only if you can stitch adjacent segments from the **same source / same speaker / same timeline**
* otherwise skip it for the first serious run

Use compute split:

* **70%** of PT compute on S1
* **20%** on S2
* **10%** on S3

That structure is much closer to what Qwen reports conceptually: broad multilingual pretraining, then high-quality continual PT, then a dedicated long-context stage. 

### Text-only data, yes or no?

Yes, but **only 5% to 10% of PT**, not some heroic 30% side quest.

For the quickest sources:

* **FineWeb** is the easiest clean English web-text add-on. The official dataset card describes it as a filtered, deduplicated English CommonCrawl-derived pretraining corpus of about **15T tokens**. ([huggingface.co][1])
* **AI4Bharat Sangraha** is a strong, practical Indic text source. AI4Bharat says it has built a **251B-token pretraining corpus across 22 Indian languages**, and the Sangraha dataset card shows a large multilingual Indic text dataset in parquet format. ([ai4bharat.iitm.ac.in][2])
* **OIG** is a quick instruction-style text source if you want a small instruction-only slice in PT. LAION describes it as an open instruction dataset with about **43M instructions**. ([laion.ai][3])
* **RedPajama-V2** is useful if you want quality signals and more control, but Together explicitly says it is **not intended to be used out of the box** and should be filtered for your use case. So it is powerful, but it is not the lazy path. ([together.ai][4])

If you want the fastest sane setup, do:

* FineWeb for English text-only
* Sangraha for Indic text-only
* a **small** OIG slice for instruction formatting

### SFT mixture

For **base SFT**, I would do this:

* **Native-script transcripts as the main path**
* **No text-only SFT mixed in**
* **No audio-tagged experimental data in the base run**
* **No heavy romanized/code-mixed mixing in the base run**

Inworld explicitly reports that mixing text-based instruction-following data during SFT degraded synthesis quality and even made speech generation unreliable, despite audio-side training loss not obviously screaming at them. That is exactly why you should not blindly trust a mixed-objective SFT because the repo happens to support it. 

Also, do **not** let HiFiTTS dominate SFT. It is clean and useful, but if you let it become a giant share of training, your model starts smelling too much like studio speech and not enough like the messy real world you clearly care about.

I would cap effective SFT sampling roughly like this:

* cleaned in-the-wild paired data as the majority
* HiFiTTS at **10% to 15%** of SFT steps max
* explicit language balancing with temperature sampling

For language balancing:

* PT: sample languages with roughly `p(lang) ∝ n_lang^0.7`
* SFT: more aggressive rebalance, roughly `p(lang) ∝ n_lang^0.5`

That prevents English from flattening the rest of the map like a bored empire.

---

## 2. Prompting / data format

Do **not** create 11 different magical prompt religions.

Use **one outer template family**, with only **3 task types** inside it.

### Task A: speech-only PT

For raw PT audio with no transcript, do **speech-only LM**, not fake TTS.

```text
<bos> <speech_start> [audio tokens] <speech_end>
```

If you want continuation training:

```text
<bos> <speech_start> [prompt audio tokens] <speech_sep> [target audio tokens] <speech_end>
```

### Task B: paired TTS / cloning-style PT+SFT

If your final inference setup is in-context voice cloning, train that format directly.

```text
<bos> <ref_text> ... </ref_text>
<ref_speech> [ref audio tokens] </ref_speech>
<target_text> ... </target_text>
<speech_start> [target audio tokens] <speech_end>
```

### Task C: text-only PT

Keep a normal text-only LM format for PT only.

```text
<bos> <user> ... </user> <assistant> ... </assistant>
```

That is the clean version of what both camps are doing:

* Inworld mixes raw audio and text-only PT, with speech boundaries marked explicitly. 
* Qwen standardizes everything into ChatML-style formatting for controllable training. 

### Important prompt rule

Do **not** keep changing the jinja for every dataset flavor.
The jinja should be dumb and deterministic. The intelligence belongs in the **sample type** and **preprocessing**, not in six layers of template spaghetti.

### Your 4 transcript variants

Use them like this:

* **Native**: main base SFT
* **Romanized**: evaluation normalization / fallback later
* **Code-mixed**: later augmentation if product inputs are actually code-mixed
* **Audio-tagged**: later control finetune, not base SFT

That gives you a sane curriculum instead of a transcript blender.

---

## 3. Exact hyperparameter plan I’d start with

These are the numbers I’d actually start from, not vague motivational poster garbage.

### Model

* **Real run:** 1.5B to 1.7B backbone
* **Smoke test:** 0.6B to 1B for 5k to 10k steps

### Shared defaults

* precision: **bf16**
* optimizer: **AdamW**
* betas: **β1 = 0.9, β2 = 0.95**
* weight decay: **0.1**
* grad clip: **1.0**
* FlashAttention2: **yes**
* fused AdamW: **yes**
* vocab expansion: add audio tokens + special tokens, initialize new embeddings from the original embedding statistics rather than dumb tiny random init, which is exactly the kind of initialization Inworld describes. 

### PT, stage S1

* max seq len: **2048**
* global batch target: about **0.5M tokens per optimizer step**
* peak LR:

  * **1.5e-4** for 0.6B to 1B
  * **1.2e-4 to 1.5e-4** for 1.5B to 1.7B
    I would start **1.2e-4** if you want the safer first launch
* warmup: **10%**
* scheduler: **cosine decay**
* end LR for S1: **3e-5 to 4e-5**

### PT, stage S2

* continue from S1 checkpoint
* start LR: **3e-5**
* end LR: **1.5e-5**
* same optimizer settings
* same seq len: **2048**
* cleaner data only

### PT, stage S3

Only do this if you can preserve real adjacency, same speaker/source ordering.

* if you can stitch valid neighboring segments: raise context to **4096**
* if you cannot, skip S3 for now

Start LR: **1.5e-5**
End LR: **7e-6**

I would **not** jump to 32768 context in your first XCodec2 50 Hz run. Qwen could go from 8192 to 32768 in its own tokenizer regime, but your setup is single-codebook 50 Hz and already expensive enough without setting fire to memory because ambition felt poetic at sunrise. 

### SFT

* max seq len: **2048**
* global batch target: about **0.25M tokens per optimizer step**
* peak LR: **1.5e-5**
* warmup: **5%**
* scheduler: **cosine**
* end LR: **5e-6**
* label masking: everything before `<speech_start>` should be **-100**
* loss on output speech only

That also lines up with Inworld’s SFT recipe directionally, including the fact that the SFT LR is much smaller and that starting SFT near the final PT LR matters. 

### Stage-wise LR bump or drop?

**Drop, not bump.**

The optimizer is not secretly a wise old monk. It does not infer curriculum for you.
When data quality gets cleaner, **lower** LR and continue.

---

## 4. Loss functions, what you actually need

### Base pretraining

Use:

* **standard causal LM cross-entropy**

That’s it.

### Base SFT

Use:

* **same cross-entropy**
* **mask prompt tokens**
* only learn the target speech region

That’s it.

### Do you need WER in PT or SFT?

No.

WER is:

* an **evaluation metric**
* or a **reward** in later RL/post-training

It is **not** your core PT/SFT loss, and it is not differentiable in any standard practical way for this setup. Inworld only uses WER later as part of a GRPO reward, together with speaker similarity and DNSMOS. 

### Do you need original audio for PT/SFT losses?

No, because your training target is the **audio token sequence**, not the waveform.

### When do you need original audio?

You need original audio if you want:

* codec retraining
* mel / adversarial / waveform reconstruction losses
* intrusive reconstruction metrics against truth audio

### Can you decode predicted tokens and compare against decoded target tokens?

Only as a **proxy eval**, not as your main training objective.

Use that proxy for:

* rough speaker-sim sanity
* rough spectral sanity
* qualitative listening

Do **not** use decoded-target-vs-decoded-prediction as the headline metric for model quality. That mostly tells you how well your LM matches the codec’s own bottlenecked view of the world.

### What to do about Indic WER

For the **first real run**, do **not** block training on Indic ASR quality.

Do this instead:

* use **WER/CER only on languages where your ASR is trustworthy**
* for other Indic languages, use **normalized CER/WER after transliteration / romanization**
* keep the **romanized transcript variant** as an eval scaffold, even if you do not train on it

That trick is actually useful. It turns your romanized data from “maybe later” into “nice, at least you’re good for something.”

### Why is there text SFT in the repo?

Because repos are often more general than the best training recipe.
For your first serious run, I would **disable mixed text SFT**. Inworld’s own report says mixing text instruction-following during audio SFT hurt synthesis quality. 

---

## 5. Metrics, curves, checkpoints

If you run this blind with only total loss, you are begging the universe to humble you.

### Track inside training

Track at minimum:

* total loss
* **audio-token CE**
* **text-token CE** if text-only PT exists
* loss by **task type**
* loss by **language**
* loss by **dataset source**
* LR
* grad norm
* tokens/sec/GPU
* pad fraction / packing efficiency
* invalid token rate
* EOS rate
* repeated-token / loop rate in short sampled generations

### Checkpoint cadence

For PT:

* save every **2k steps** for the first **20k**
* then every **5k**

For SFT:

* every **1k** early
* then every **2k**

`10k` as the very first cadence is too coarse. By the time you discover a bad run, the cluster has already eaten a month of your life.

### External eval cadence

Since you want a separate machine, good. That’s the correct kind of laziness.

Do:

* **small canary eval** every checkpoint
* **full eval** every 5k PT steps
* **full eval** every 1k to 2k SFT steps

### External eval metrics

For the async evaluator:

* English WER
* Hindi + 2 to 4 other reliable Indic CER/WER tracks
* speaker similarity
* DNSMOS or UTMOS
* hallucination / artifact rate
* long-form subset
* number reading subset
* punctuation / abbreviations subset

Keep decoding params **fixed** during model selection. Do not change temperature every other day and pretend the curve means something.

Also, Inworld’s own report is a useful warning here: they found mixed-objective SFT could degrade synthesis quality even when training loss was not clearly flagging disaster. So yes, external generation-based eval is mandatory. 

---

## 6. Monitoring, infra, and data handling

With 80M segments, please do not let the dataloader open millions of tiny objects from R2 like a caffeinated raccoon.

### Use one of these

* **WebDataset** with tar shards
* **MosaicML Streaming / MDS** if you want object-store streaming + local cache
* local NVMe cache in front of R2

For your scale, I’d prefer:

* shard size around **1 GB to 4 GB**
* per-shard manifest
* local SSD cache
* async prefetch

### Manifest fields you absolutely need

From now on, never store “just codes”.

Store:

* `sample_id`
* `source_video_id`
* `segment_order`
* `start_time`, `end_time`
* `speaker_id` or speaker-cluster id
* `language`
* `transcript_native`
* `transcript_romanized`
* `dataset_source`
* `code_len`
* `codec_ckpt_hash`
* `tokenizer_version`
* `normalization_version`
* checksum for the code array

Your codes are now the dataset schema. Treat the codec hash like a database migration, not a casual footnote.

### Logging stack

Use:

* **W&B** or **TensorBoard** for training curves
* **Prometheus + Grafana** for system telemetry
* GPU health / utilization exporter
* async checkpoint uploader to **R2**
* only push **milestone checkpoints** to Hugging Face if storage/egress matters

### Throughput sanity

Inworld reports around **46k tokens/sec/GPU** in PT for the smaller model and around **18k tokens/sec/GPU** in SFT on 32 H100s, with PT benefiting from `torch.compile()` and SFT not benefiting as much due to variable sequence lengths. Do not treat those as gospel because your stack differs, but if you are absurdly below that kind of order of magnitude on comparable hardware, something is sick. 

---

## 7. What I would launch right now

This is the part that matters.

### Phase 0, one-day smoke test

* use the repo path with the least engineering friction
* 0.6B or 1B model
* 5k to 10k steps
* PT mixture: **80/10/10**
* seq len **2048**
* verify:

  * loss drops
  * no NaNs
  * valid audio generations
  * no special-token collapse
  * external evaluator works

### Phase 1, real PT

* switch to **1.5B to 1.7B**
* freeze the current codec completely
* run **S1 broad PT**
* then **S2 clean CPT**
* only run **S3 long-context** if you have real adjacency metadata

### Phase 2, real SFT

* native-script only
* cloning/in-context format as main task
* no text-only SFT
* no audio-tagged control data yet
* no RL yet

### Phase 3, only after base model is good

Then, and only then:

* control LoRA / markup finetune
* code-mixed / romanized augmentation
* RL alignment on a small high-quality subset with usable ASR

---

## Final verdict on the panic question

**No, you are not fucked.**

For your **SpeechLM PT and SFT**, stored codec tokens are enough.
You do **not** need original segments for the base losses.
You do **not** need WER optimization to start training.
You **should not** block this run on Indic ASR quality.
You **should** freeze the codec, stage the data, keep text-only in PT not SFT, and automate async eval.

The real mistake would be letting this panic stop the run entirely. That would be a much more expensive form of theater.

[1]: https://huggingface.co/datasets/HuggingFaceFW/fineweb "HuggingFaceFW/fineweb · Datasets at Hugging Face"
[2]: https://ai4bharat.iitm.ac.in/areas/llm "AI4Bharat"
[3]: https://laion.ai/blog/oig-dataset/ "The OIG Dataset | LAION"
[4]: https://www.together.ai/blog/redpajama-v2-faq "FAQ: Building LLMs with RedPajama-v2, a 30 trillion token web dataset"