You’re not fucked. You made future evaluation more annoying, but you did **not** destroy the core SpeechLM training run.

For a codec-token TTS stack, the part that truly needs original waveforms is the **codec/vocoder** training stage. Qwen’s tokenizer training and Inworld’s decoder training use waveform-dependent objectives like adversarial, mel/reconstruction-style, feature-matching, and loudness losses. But once the codec is frozen, the **SpeechLM** part is much simpler: Inworld does SpeechLM pretraining as standard next-token prediction, and SFT as autoregressive negative log-likelihood over target speech tokens conditioned on text. Qwen3-TTS also separates tokenizer training from SpeechLM pretraining and post-training. ([ar5iv][1]) 

The missing waveforms mostly remove **exact source-domain reconstruction checks** on that PT corpus. They do **not** block PT itself. Inworld explicitly stores indexes alongside pretokenized audio codes and reports roughly **500:1 compression versus raw audio**, reusing the same codes across many runs. Treat your saved codes as the canonical PT dataset now. That is normal, not some apocalyptic crime scene. ([arXiv][2])

Your 5AM WER panic can also sit down. For TTS, WER is computed by decoding generated tokens to audio, transcribing **that generated audio** with ASR, and comparing the transcript to the **intended target text**. Inworld’s reward pipeline does exactly that. Speaker similarity compares generated audio to the reference prompt audio, and DNSMOS scores the generated audio itself. So **original target waveforms are not required for WER**. ([arXiv][2])

What you **cannot** do anymore on that PT corpus is exact codec reconstruction eval against the original waveforms, re-run source-audio filtering like DNSMOS/CPS unless you already saved those metadata, or pretend decoded-code audio is a perfect replacement for the source. Decoded-code audio is fine as an internal proxy sanity check, not as holy scripture.

## Model pick

Do two runs, not one giant faith-based GPU ritual.

Qwen3 has official dense **0.6B** and **1.7B** variants, Qwen3-TTS uses those same size classes, and the **1.7B** base beats the **0.6B** base on their reported zero-shot WER. Inworld’s efficient production model is **1.6B**, and their open trainer example config defaults to **Llama-3.2-1B-Instruct**. My recommendation is:

* **Smoke/debug run:** Llama-3.2-1B-Instruct or a 0.6B/1B dense model
* **Real multilingual run:** dense **Qwen3-1.7B** class model

That gives you low-friction bring-up first, then the serious run where multilingual text understanding actually matters. Do **not** use MoE. Do **not** use a “thinking” model. Dense and boring wins here. ([GitHub][3]) 

## 1. Data mixture

Split by **language** and by **quality tier**. Do **not** throw everything into one mega-dataset and hope sampling karma saves you. In the open Inworld trainer, dataset weighting is implemented as dataset-level “epochs” over **sample counts**, not token counts, so naive weights will skew the actual token mix whenever average durations differ. Weight by **target token mass**, not raw sample counts. ([GitHub][4])

My recommended **PT-S1** mix by **token budget**:

```text
82% raw PT audio codes
8% HQ paired bootstrap audio-text from SFT (about 15k to 20k hours)
10% text-only tokens
```

That shape is close to what worked in Inworld: mostly raw audio, a small HQ paired bootstrap, and about 10% text data. Their report says the HQ bootstrap was 15k hours and the text side used RedPajama-v2 plus LAION OIG. ([arXiv][2])

Inside the **audio pool**, do this:

```text
PT language target
English cap: 45%
Hindi: 15%
Remaining 10 Indic languages: 40% total, using sqrt(hours_lang) sampling with a 2% floor each
```

That is not fake equality. It is controlled unfairness so English does not eat your model alive.

For **SFT**, I would do this by token budget:

```text
65% cleaned native-only in-the-wild paired data
20% HifiTTS-2
10% custom curated paired data
5% IndicVoices-R anchor
```

Then enforce language caps on top:

```text
English <= 55% of total SFT token budget
Hindi ≈ 15%
Remaining 10 Indic languages ≈ 30% total, floor 2% each
```

Also, **upweight IndicVoices-R** beyond its raw hours. It is small but clean, which makes it a very good anchor set instead of a rounding error.

For **text-only PT**, the fastest decent sources are:

* **English:** RedPajama-v2 or FineWeb
* **Indic:** AI4Bharat **IndicCorpV2** or **Sangraha**

Inworld used RedPajama-v2 and LAION OIG in PT, and AI4Bharat exposes strong Indic text corpora you can grab without inventing a fresh web-crawl headache for yourself. ([arXiv][2])

Preprocess text once, then freeze it:

```text
language ID
dedupe / near-dedupe
Unicode + script normalization
punctuation / numeral normalization
strip URLs, HTML, code blocks, tables
drop docs < 128 chars
drop docs > 4k tokenizer tokens
tokenize with the base LM tokenizer
save as pretraining token memmaps
```

A small instruction slice is fine in **PT**. Do **not** mix text-only instruction data into **SFT run 1**. Inworld reports that mixed text instruction-following during SFT degraded synthesis quality. ([ar5iv][1])

## 2. Prompting

Use **one grammar**, not three unrelated religions.

Qwen3-TTS says all training data is standardized in **ChatML**, but the open Inworld trainer keeps TTS prompting much simpler: TTS SFT uses transcript plus optional voice description followed by `<|speech_start|> ... <|speech_end|>`, and the PT datasets keep text and audio PT streams separate. For a first run, simplicity wins.  ([GitHub][5])

Use this mental model:

```text
PT raw audio: audio-code chunk only
PT text: text-token chunk only
SFT TTS: [optional voice description] + native transcript + <|speech_start|> codes <|speech_end|>
```

For **run 1**:

* Use **Native** transcript only
* Skip code-mixed
* Skip romanized
* Skip your experimental audio-tagged format

Get clean native-script speech working first. Then add more chaos. Humanity produces enough chaos for free already.

Do **not** keep changing the Jinja/template format by dataset. Keep the same special tokens from day 0. Optional fields are fine. Multiple incompatible prompt grammars are how people end up with a model that sort of understands everything and cleanly speaks nothing.

## 3. Hyperparameters and stage plan

I’m anchoring this to the public Inworld trainer/report shape: **2048** context, **AdamW**, **bf16**, **cosine decay**, **0.1** weight decay, and roughly **0.5M tokens per optimizer step** for the 1B-class PT setup. Their report also says SFT worked best when initialized at the **final PT learning rate**. The example config uses `betas=(0.9, 0.95)` and `gradient_clip=1.0`. ([GitHub][6])

### Smoke run

Use this first, for 5k to 10k steps:

```text
model: 0.6B or 1B dense
max_seq_len: 2048
global_batch_tokens: 262,144
optimizer: AdamW
betas: (0.9, 0.95)
eps: 1e-8
weight_decay: 0.1
grad_clip: 1.0
precision: bf16
gradient_checkpointing: on
compile: on for PT only
peak_lr:
  0.6B -> 2e-4
  1B   -> 1.5e-4
min_lr: 1e-5
warmup_ratio: 0.05
schedule: cosine
```

### Main run, 1.7B

**PT-S1 general**

```text
max_seq_len: 2048
global_batch_tokens: 524,288
peak_lr: 1e-4
min_lr: 1e-5
warmup_ratio: 0.08
betas: (0.9, 0.95)
eps: 1e-8
weight_decay: 0.1
grad_clip: 1.0
precision: bf16
gradient_checkpointing: on
target_tokens: 200B to 250B
```

Useful microbatch examples:

```text
8 GPUs  -> microbatch 2, grad_accum 16
16 GPUs -> microbatch 2, grad_accum 8
32 GPUs -> microbatch 2, grad_accum 4
```

**PT-S2 high-quality continual PT**

```text
max_seq_len: 2048
global_batch_tokens: 524,288
peak_lr: 5e-5
min_lr: 5e-6
warmup_ratio: 0.03
target_tokens: 20B to 30B
data mix: 70% HQ paired, 20% clean raw audio, 10% text-only
```

**PT-S3 long-context**

Do **4096** first. Do **not** jump to 8192 on day one unless the rest is already stable.

```text
max_seq_len: 4096
global_batch_tokens: 262,144
peak_lr: 3e-5
min_lr: 3e-6
warmup_ratio: 0.02
target_tokens: 10B to 20B
data: genuinely long-form subset only
```

Qwen3-TTS explicitly uses a staged curriculum with a later long-context phase, rather than forcing long context from the start. 

**SFT**

Assuming xcodec2-like **50 Hz**, your **160k to 180k hours** of SFT is about **28.8B to 32.4B audio tokens**. That is plenty for a serious SFT pass. The Inworld constants and codec setup are also 50 tokens/sec for their xcodec2-like single-codebook stack. ([GitHub][7])

Use:

```text
max_seq_len: 2048
global_batch_tokens: 262,144 to 524,288
peak_lr: 1e-5
min_lr: 1e-6
warmup_ratio: 0.03
schedule: cosine
epochs: 1.0 to 1.5 effective passes
text-only SFT: off
sort/bucket by length: on
```

And one important point: **do not do LR bumps inside a stage**. The optimizer does not magically understand your curriculum. Use a scheduler **within** each stage, then restart that scheduler at a **lower peak LR** when you change data distribution or context length.

## 4. Loss functions

For **run 1**, keep it boring:

```text
PT:
  standard causal LM cross-entropy on next token
  for audio PT streams and text PT streams

SFT:
  same cross-entropy
  mask everything before <|speech_start|> to -100
```

That is it.

The repo implements exactly that masking behavior for TTS SFT, and a separate text fine-tuning path masks everything before the final assistant response. But the Inworld report says mixing text instruction-following data during SFT hurt speech quality, so for your first serious TTS run, **text belongs in PT, not SFT**. ([GitHub][8])

Also, since your codec is **single-codebook xcodec2-like**, do **not** copy Qwen’s multi-codebook MTP losses. Those are for their **12Hz multi-codebook tokenizer**, not your setup. One LM head, one CE objective. 

### About WER

WER is **not** a normal PT/SFT loss here. It is:

* an **eval metric**
* or a **reward** in a later RL stage

That’s exactly how Inworld uses it, together with speaker similarity and DNSMOS in GRPO-style post-training. Qwen3-TTS also uses post-training methods after PT, not WER-as-main-loss in PT. ([arXiv][2]) 

For **Indic ASR**, you are not totally stranded. AI4Bharat’s **IndicConformer** now covers **22 official Indian languages**, so you have an open baseline for offline eval. For Indic, I would prefer **CER or normalized token error** over naive whitespace word splitting, and I would **not** inject ASR-based rewards into training until you trust the per-language ASR enough to not poison the signal. ([AI4Bharat][9])

For later RL, my starting reward for languages with reliable ASR would be:

```text
0.4 content (WER/CER)
0.3 speaker similarity
0.3 DNSMOS / UTMOS proxy
```

For languages with weak ASR, set content reward to zero rather than feeding garbage into the reward model. Garbage rewards are how RL turns from “alignment” into “performance art.”

## 5. What to track and checkpoint policy

Do **not** run blind. That is how people spend a week training a very expensive parrot with a throat injury.

Track three buckets:

### Optimization

* total loss
* audio-token loss
* text-token loss
* loss by dataset source
* loss by language
* grad norm
* LR
* tokens/sec
* audio-seconds/sec
* dataloader wait time
* padding ratio
* skipped steps / NaNs

### Generative sanity

* EOS / stop-token failure rate
* average generated length / target length
* consecutive token repeat rate
* n-gram repetition rate
* silence-only decode rate
* decode crash / invalid token rate

### External quality

Run this on the sidecar evaluator:

* WER/CER
* speaker similarity
* DNSMOS / UTMOS
* duration ratio
* per-language decode panels
* long-form failure rate
* cross-lingual cloning sanity

### Checkpoints

Use this policy:

```text
steps 0 to 10k: save every 1k
10k to 50k:     save every 5k
50k onward:     save every 10k
milestones:     keep every 50k or 100k permanently
```

Storage policy:

```text
R2: full checkpoints + optimizer states
HF: milestone model-only checkpoints + config + tokenizer + eval notes
keep_last_full: 5
keep_last_model_only: 20
```

Do **not** push every full optimizer-state checkpoint to Hugging Face unless you enjoy wrestling LFS for sport.

And yes, keep a tiny frozen in-loop validation set. **1k samples is enough**. CE on codes/text is cheap. Heavy decode eval can stay on the separate machine.

## 6. Monitoring, tooling, and data handling

Use:

* **W&B** for losses, panels, audio tables
* **NVIDIA DCGM exporter + Prometheus + Grafana** for node/GPU telemetry
* **nvitop** for SSH sanity checks
* **PyTorch profiler** on the first 500 steps and one later steady-state window

For storage, **do not train by random-reading memmaps directly from R2**. Stage active shards onto local NVMe and prefetch asynchronously. Inworld’s whole storage trick is pretokenized codes plus indexes to cut I/O and storage; lean into that, but cache the hot shards locally. ([arXiv][2])

Before the real run, do this audit on **10k samples**:

```text
code ids within [0, codebook_size - 1]
no OOV speech tokens
duration matches code length
empty / punct-only transcripts
language-tag vs script mismatch
per-language duration histograms
char-per-second and char-per-code outliers
decode 100 random samples per language
```

And do a **32-sample overfit test**. If the model cannot brutally overfit 32 samples, do **not** light the cluster on fire yet.

## 7. Separate validation/test machine

That part is fine. The right version is:

* training job saves checkpoints
* evaluator machine polls every 5k steps
* evaluator runs fixed prompt packs
* metrics go back to W&B / your DB
* training never stalls

But I would still keep **one tiny in-loop CE validation set** inside the training job. Skipping even that is reckless. One bad prompt bug or one bad dataset path and you’ll get a beautiful loss curve for a model that speaks like a haunted fax machine.

## My blunt launch order

1. **Freeze your current PT codes as dataset-v1.** Stop caring whether you can regenerate them.
2. **Build a tiny gold eval pack now** from SFT data where original audio is preserved: 1k CE-val samples, 64 fixed decode prompts, multilingual and cross-lingual.
3. **Run the 32-sample overfit test.**
4. **Run the 0.6B/1B smoke job** for 5k to 10k steps.
5. **Launch 1.7B PT-S1** with the exact mixture above.
6. **Do HQ CPT**, then **SFT native-only**.
7. Add code-mixed / romanized / experimental markup data **only after** the base model is clean and stable.

One last thing for future-you: every new corpus version should save a manifest with source id, start/end ms, VAD config hash, codec checkpoint hash, sample rate, transcript hash, and normalization flags. Past-you already made one mess. No need to franchise it.

[1]: https://ar5iv.org/pdf/2507.21138 "https://ar5iv.org/pdf/2507.21138"
[2]: https://arxiv.org/html/2507.21138v1 "TTS-1 Technical Report"
[3]: https://github.com/QwenLM/Qwen3 "https://github.com/QwenLM/Qwen3"
[4]: https://raw.githubusercontent.com/inworld-ai/tts/main/tts/data/tts_datasets.py "raw.githubusercontent.com"
[5]: https://raw.githubusercontent.com/inworld-ai/tts/main/tts/core/prompting.py "raw.githubusercontent.com"
[6]: https://raw.githubusercontent.com/inworld-ai/tts/main/example/configs/sft.json "raw.githubusercontent.com"
[7]: https://raw.githubusercontent.com/inworld-ai/tts/main/tts/core/constants.py "raw.githubusercontent.com"
[8]: https://raw.githubusercontent.com/inworld-ai/tts/main/tts/data/datasets/finetuning.py "raw.githubusercontent.com"
[9]: https://ai4bharat.iitm.ac.in/areas/asr "https://ai4bharat.iitm.ac.in/areas/asr"
