Bharath, this is a meaty set of questions. Let me go through everything systematically. I've read through all three papers carefully alongside your setup.

## The Panic — You're Not Fucked

Deep breath. Let me explain why.

**Pretraining loss is just next-token prediction cross-entropy on the codec tokens.** That's it. All three papers confirm this — Inworld explicitly states standard unsupervised language modeling via next token prediction (Section 3.3), CosyVoice 3 does the same for their LM stage, and Qwen3-TTS's general stage (S1) is just building a monotonic text-to-speech mapping with cross-entropy. No WER, no MOS, no DNSMOS, no speaker similarity is computed during pretraining or SFT. Those are either evaluation metrics or RL reward signals used in a *post-training* stage that comes much later.

So your pre-computed codes are exactly what you need for pretraining. You're fine.

**WER as a "loss function"** — it's not. WER is never directly optimized during pretraining or SFT in any of these systems. It only appears in two places: (a) as an evaluation metric to benchmark checkpoints, and (b) as a reward signal during RL post-training (GRPO in Inworld, DiffRO in CosyVoice 3, DPO/GSPO in Qwen3-TTS). That RL stage is completely optional and comes after you already have a working model from pretraining + SFT. You can add it later when you have a functioning system.

**For the Indic ASR gap** — when you eventually get to RL post-training or even just evaluation, you can use IndicWhisper or AI4Bharat models as approximate WER estimators. They're not perfect but they're usable for relative comparisons and reward shaping. But again, this is a problem for months from now, not today.

**If you ever need original audio** (for RL rewards, evaluation, MOS scoring), you decode your XCodec2 tokens back to waveforms. Codec → audio is deterministic given the decoder weights. The reconstructed audio won't be bit-identical to the original YouTube segments, but it'll be perceptually close enough for computing DNSMOS, speaker similarity, and even WER. That's actually what CosyVoice 3's DiffRO does — it samples tokens via Gumbel-Softmax and then runs downstream models on the token representations, not on original audio. So even for RL, you're covered.

---

Now let me address each of your questions.

## 1. Data Mixture for Pretraining

Your plan of PT-data + 10% SFT is solid and directly mirrors what Inworld did — they incorporated ~15k hours of high-quality SFT data into the pretraining mixture to bootstrap text-to-speech alignment. With your scale (~3M videos for PT + 10% of ~170k hours SFT), the ratio is reasonable.

**On adding text data** — yes, you should. Inworld found this critical. They added ~10% text tokens from RedPajama-v2 (filtered for quality) plus instruction data from LAION OIG. The reasoning: it prevents degradation of the base LLM's text comprehension during the audio-heavy pretraining, which directly helps during SFT when the model needs to understand text prompts properly. Without text, the model's text understanding capabilities erode since you're flooding it with audio tokens.

For your quickest, easiest sources of text data given your Indic focus:

The fastest path is to grab **Sangraha** (AI4Bharat's cleaned Indic text corpus) which covers all your 11 languages, plus **CulturaX** for English. Both are on HuggingFace, ready to download. For instruction-style data, use **Aya Dataset** from Cohere — it has multilingual instruction-following data including Hindi and several Indic languages. You don't need to do heavy processing on these. Just tokenize them with your LLM's text tokenizer, wrap them in the same special token format, and mix them in.

Target roughly 10-15% text tokens in your pretraining mixture. So your final mix: **~85% PT audio codes + ~10% SFT audio codes + ~5-10% text data** (split between clean web text and instruction data). The Inworld paper used ~20B text tokens alongside their 1M hour audio corpus. Scale proportionally for your setup.

One thing to watch: make sure your text data is **long-form and high quality**. Filter out short snippets, boilerplate, and low-quality web scrapes. Inworld specifically filtered RedPajama for long-form documents.

## 2. Prompting / Input Formatting

Yes, the prompt format changes based on the data type, and this is important to get right before you start.

**For pretraining on raw audio codes** (your 3M PT data): The format is minimal — just wrap the audio tokens between special tokens. Inworld uses:
```
<|speech_start|> y1 y2 ... yS <|speech_end|>
```
No text prompt, no speaker info. The model just learns to predict continuous speech patterns. This is unsupervised next-token prediction on audio.

**For pretraining on SFT-subset data** (the 10% you're mixing in): Use the full TTS format so the model starts learning text-to-speech alignment early:
```
<|begin_of_text|> x1 x2 ... xT <|speech_start|> y1 y2 ... yS <|speech_end|>
```
where x tokens are the text transcription and y tokens are the audio codes.

**For pretraining on text data**: Standard text format, same as what the base LLM was trained on. Just text tokens with the LLM's native BOS/EOS tokens.

**For SFT**: Same TTS format as above, but now all data has paired text+audio. The Inworld format is exactly what you described.

You generally don't need to modify the Jinja template beyond ensuring these three formats are handled correctly. The key thing is: **be consistent with your special tokens from day one**. Define `<|speech_start|>`, `<|speech_end|>`, and any other special tokens you need before pretraining starts, add them to the vocabulary, and never change them. Inworld added 65536 audio tokens + 29 special tokens to the vocabulary and initialized the new embeddings by sampling from the mean/covariance of the original embedding matrix.

## 3. Hyperparameters — Concrete Numbers

Based on what worked across all three papers, here's what I'd recommend for your setup. I'm assuming you're starting with Llama3.1-1B as the primary candidate (closest to Inworld TTS-1's 1B backbone):

**Pretraining:**

| Param | Value |
|---|---|
| Optimizer | AdamW, β1=0.9, β2=0.95 |
| Peak LR | 1.5e-4 |
| LR Schedule | Cosine decay with 10% warmup |
| Weight decay | 0.1 |
| Max sequence length | 2048 tokens |
| Batch size | As large as your GPUs allow (Inworld used 8 per GPU on H100s with DDP) |
| Precision | bf16 mixed precision |
| Gradient clipping | 1.0 |

**No stage-wise LR bumping needed for pretraining.** A single cosine schedule with warmup handles it. The optimizer (AdamW with the above betas) adapts naturally. Inworld ran a single continuous pretraining run — their loss curve in Figure 3 shows smooth convergence with no restarts.

**For SFT:**

| Param | Value |
|---|---|
| Peak LR | 1.5e-5 (10x lower than PT — this is critical) |
| LR Schedule | Cosine decay with 5% warmup |
| Starting checkpoint | Your best pretraining checkpoint |
| Everything else | Same as pretraining |

The key insight from Inworld: **initialize SFT learning rate to the final LR from pretraining**. They found this was crucial for optimal speech quality. So if your cosine schedule decays to ~1.5e-5 at the end of pretraining, start SFT at 1.5e-5 peak.

**If you go with Qwen3 0.6B or 1.7B instead**, the hyperparameters stay roughly the same. The main thing that changes is batch size and distributed strategy (DDP for ≤1B, FSDP or DeepSpeed Stage 2 for 1.7B+).

**Vocabulary expansion**: Add your 65536 XCodec2 tokens + special tokens to the base model vocabulary. Pad the total vocab size to a multiple of 32 for tensor core efficiency. Initialize new embeddings using the mean/covariance sampling method from Inworld (or simpler: just use the mean of existing embeddings with small random noise).

## 4. Loss Functions

**Pretraining**: Standard causal LM cross-entropy (next-token prediction). That's literally it. All three papers use this and nothing else for the pretraining stage.

**SFT**: Same cross-entropy loss, but with masking. Your understanding is correct:
- For TTS samples: everything before `<|speech_start|>` is masked (set to -100). The model only learns to predict audio tokens given the text prompt.
- Loss is only computed on the audio token portion.

**Why Inworld also does text SFT** — this is actually a smart move, not a mistake. They found that *during SFT*, if you only feed audio tasks, the model's text understanding degrades. But they also found that mixing text instruction data during SFT caused synthesis quality degradation (they explicitly mention this in Section 3.4 under "Challenges with mixed-objective fine-tuning"). So they compromised: text data goes into pretraining only, and SFT is audio-only. The text SFT masking code in the repo is likely there from their experiments with mixed-objective training. For your run, **keep SFT audio-only** — it's what they found works best.

**You do NOT need WER, MOS, or any audio-level metric as a training loss.** These are evaluation metrics and RL rewards. Your pretraining and SFT stages are purely cross-entropy on tokens.

## 5. Checkpointing and Metrics to Track

**Metrics to log every step:**
- Training loss (cross-entropy)
- Learning rate
- Gradient norm
- Tokens per second throughput
- GPU memory utilization

**Metrics to log every N steps (say every 500-1000 steps):**
- Perplexity on a held-out mini validation set (even just 1000 samples you set aside — you said no validation during training, but logging perplexity on a tiny fixed set costs almost nothing and gives you an early warning if something goes wrong)

**Checkpointing frequency**: With 80M segments, your step count depends on batch size. A rough calculation: if your effective batch size is ~256 sequences, that's ~312k steps for one epoch. Checkpointing every 10k steps gives you ~31 checkpoints per epoch, which is reasonable. For the first 20-30k steps, checkpoint more frequently (every 5k) since early training is where things are most likely to go sideways.

Don't checkpoint every 10k forever though — that's a lot of storage at 1B+ params. Keep the last 5 checkpoints rolling, plus manually pin any that look promising from your external eval runs.

**What to watch for in the loss curve**: The Inworld pretraining loss (Figure 3) drops sharply in the first ~25B tokens then gradually flattens. If your loss plateaus too early or spikes, something is wrong with your data pipeline or hyperparameters.

## 6. Monitoring and Tooling

**For experiment tracking**: Weights & Biases (wandb) is the path of least resistance. It logs everything I mentioned above with minimal code. If you want something self-hosted, MLflow works too.

**For data loading**: Since your codes are in R2, you'll want a streaming data loader that pulls from R2 on-the-fly rather than downloading everything to local disk. Use `webdataset` or `mosaic-streaming` (MosaicML's StreamingDataset) — both handle remote storage well and support efficient shuffling of sharded data. Given you've already worked with WebDataset format before, stick with that.

**One thing you didn't mention**: Data shuffling. For pretraining, ensure your data is well-shuffled across languages and domains. If your data loader serves all English first then all Hindi, the model will catastrophically forget earlier languages. Shard your data so each shard contains a mix of languages, or use a multi-source sampler that interleaves from different language buckets.

**torch.compile()**: Use it for pretraining where sequence length is fixed (2048). Inworld got significant throughput gains from this. For SFT with variable-length sequences, it won't be applicable.

**FlashAttention-2**: Use it. Non-negotiable for training efficiency at this scale.

## 7. What You Might Have Missed

A few things that aren't in your list but matter:

**Vocab initialization matters more than you think.** When you add 65k+ audio tokens to a pretrained LLM, how you initialize those embeddings affects early training stability. The Inworld approach (sample from the multivariate normal of existing embeddings) is simple and effective. Don't just use random initialization — it'll cause loss spikes early on.

**Sequence packing**: If your audio segments vary in length (they will), pack multiple shorter sequences into a single 2048-token training sequence to avoid wasting compute on padding. This can improve throughput by 30-50%.

**Audio token rate**: Your XCodec2 runs at 50 tokens/sec (same as Inworld's). A 2048-token context window gives you ~40 seconds of audio max. Make sure your VAD segments fit within this. If you have segments longer than 40s, truncate or split them.

**Language sampling strategy**: Don't just use natural proportions from your data. English will dominate everything otherwise. CosyVoice 3 and Qwen3-TTS both use strategic upsampling of minority languages. For your Indic languages, consider temperature-based sampling (common approach: sample language proportionally to sqrt(proportion) or proportion^0.7 to boost low-resource languages without overwhelming the model).

---

To summarize the immediate action items before you start:

1. Your pre-computed codes are fine. No blunder.
2. Grab Sangraha + CulturaX + Aya for text data, tokenize, and add to your pretraining mix at ~10%.
3. Set up the three input formats (raw audio, text+audio, text-only) with consistent special tokens.
4. Use the hyperparameters above, start pretraining with cosine LR schedule, bf16, FlashAttention-2.
5. Log loss + gradient norm + throughput. Checkpoint every 5-10k steps.
6. Ensure data shuffling across languages.

Go start the run. The rest (RL, evaluation pipeline, Indic WER workarounds) can wait until you have a pretrained model.