---
name: tts readiness plan
overview: Harden this repo for a serious LLaMA-first PT/SFT training run by upgrading the data pipeline, batching/mixing, evaluation, and checkpoint/cluster-readiness layers while deferring RLHF and backbone migration.
todos:
  - id: audit-llama-path
    content: Lock the LLaMA-first scope and list the code paths that currently assume LLaMA tokenization/model behavior.
    status: completed
  - id: build-pt-data-path
    content: Design the missing pretraining data builder and multilingual-safe manifest format for audio-code PT and text-only PT inputs.
    status: in_progress
  - id: upgrade-mixing-batching
    content: Plan token-aware dataset mixing plus SFT length bucketing/max-token batching to improve multilingual balance and throughput.
    status: pending
  - id: harden-trainer-core
    content: Plan trainer changes for explicit stage budgets, configurable scheduler floors, safer resume state, and better checkpoint promotion.
    status: pending
  - id: build-eval-arsenal
    content: Design a fixed multilingual eval canary pack and config-driven quality-validation/monitoring workflow for PT and SFT.
    status: pending
  - id: stage-runbooks
    content: Prepare the staged smoke-PT, main-PT, and base-SFT runbooks and defer RLHF/long-context/control-data phases until after baseline stability.
    status: pending
isProject: false
---

# TTS Training Readiness Plan

## Planning Decisions

- Stay `Llama-3.x` first. The current stack is already biased toward LLaMA assumptions in [tts/core/tokenization.py](tts/core/tokenization.py), [tts/core/modeling.py](tts/core/modeling.py), and [tts/data/datasets/finetuning.py](tts/data/datasets/finetuning.py), so this is the lowest-risk path to a real run.
- Scope this milestone to `PT + SFT + strong eval/monitoring + cluster readiness`. Do not spend the first pass on RLHF; the current RLHF path is useful reference code but not yet the most reliable place to invest.
- Keep the base recipe simple: causal CE for PT, masked CE for SFT, text-only data in PT not base SFT, native-script paired SFT first, long-context only after the base pipeline is stable.

## What The Repo Already Gives Us

- Working PT/SFT trainer entrypoint in [tts/training/main.py](tts/training/main.py) and loop in [tts/training/training_loop.py](tts/training/training_loop.py).
- Vectorization and shard merge utilities in [tools/data/data_vectorizer.py](tools/data/data_vectorizer.py) and [tools/data/data_merger.py](tools/data/data_merger.py).
- DDP/FSDP/DeepSpeed hooks plus FlashAttention2 and fused AdamW in [tts/training/environment.py](tts/training/environment.py) and [tts/core/optimization.py](tts/core/optimization.py).
- Basic decode-side quality hooks in [tts/inference/quality_validation.py](tts/inference/quality_validation.py) and W&B scalar logging in [tts/utils/custom_logging.py](tts/utils/custom_logging.py).

## Highest-Value Gaps To Close

- Pretraining data prep is incomplete: [tts/data/datasets/pretraining.py](tts/data/datasets/pretraining.py) expects `*_pretraining_codes.npy` and `*_pretraining_tokens.npy`, but the repo only ships vectorization/merge tools for SFT-style `codes/index/samples` datasets.
- Multilingual readiness is blocked early: [tts/data/data_utils.py](tts/data/data_utils.py) still applies `filter_non_english`, and [tts/data/text_normalization.py](tts/data/text_normalization.py) only covers a narrow language set.
- Dataset mixing is sample-count based, not token/audio-duration aware: [tts/data/tts_datasets.py](tts/data/tts_datasets.py) uses `floor(len(dataset) * weight)`, which will distort multilingual and duration-balanced mixtures.
- SFT efficiency is still basic: batches are padded dynamically but there is no max-token batching, bucketing, or sequence packing in [tts/data/tts_datasets.py](tts/data/tts_datasets.py).
- Eval/checkpointing are not cluster-hardened: [tts/training/evaluation.py](tts/training/evaluation.py) is loss-only and not globally reduced, while [tts/training/checkpointing.py](tts/training/checkpointing.py) and [tts/training/training_loop.py](tts/training/training_loop.py) do not yet track best checkpoints or full restart state.
- Quality validation is placeholder-driven: [tts/inference/quality_validation.py](tts/inference/quality_validation.py) still relies on hard-coded prompt WAV and codec checkpoint placeholders instead of config-managed multilingual canaries.

## Final Recommendation Synthesis

- Keep from the strongest agent recommendations:
- Use a staged PT recipe with text-only data mixed into PT, not base SFT.
- Keep one prompt grammar and one loss story; avoid multiple prompt dialects and avoid WER/RLHF scope creep in the first milestone.
- Prioritize token-aware multilingual mixing, fixed decode canaries, and strong checkpoint hygiene before scaling up.
- Defer for now:
- Immediate Qwen migration. The current code is not yet backbone-agnostic enough for that to be the fastest path.
- Immediate RLHF readiness. It can remain a later phase after PT/SFT are stable.
- Aggressive long-context or mixed-objective SFT. Both should come after the baseline model is clean.

## Workstreams

### 1. Data Pipeline Hardening

- Remove the English-only preprocessing gate in [tts/data/data_utils.py](tts/data/data_utils.py) and make multilingual filtering explicit/config-driven via [tts/data/filtering.py](tts/data/filtering.py).
- Add a real PT builder that converts merged code datasets plus text corpora into the `train|val_pretraining_codes.npy` and `train|val_pretraining_tokens.npy` files expected by [tts/data/datasets/pretraining.py](tts/data/datasets/pretraining.py).
- Emit reproducibility manifests during vectorization/PT-build time with codec checkpoint hash, split seed, language/source totals, token totals, sample counts, and filtering stats.
- Plan local-NVMe staging and shard-level caching for large runs instead of assuming direct random reads from remote storage.

### 2. Mixture And Batching Efficiency

- Replace sample-count weighting in [tts/data/tts_datasets.py](tts/data/tts_datasets.py) with token/audio-second-aware mixing so PT and SFT proportions reflect actual training mass.
- Add language-balancing controls for multilingual PT/SFT mixtures.
- Add SFT length bucketing or max-token batching so throughput and padding stay sane at cluster scale.
- Preserve `torch.compile`, FlashAttention2, fused AdamW, and gradient checkpointing as the default performance toolkit for PT.

### 3. Prompting And Training Semantics

- Keep the simple prompt surface in [tts/core/prompting.py](tts/core/prompting.py) for run 1 rather than switching to more complex template families.
- Keep native-script paired SFT as the base run. Romanized, code-mixed, and control-style annotations stay as later augmentation phases.
- Evaluate whether to add an optional prompt-conditioned SFT dataset path, because inference in [tts/inference/inferencing.py](tts/inference/inferencing.py) already conditions on prompt audio while the current SFT dataset in [tts/data/datasets/finetuning.py](tts/data/datasets/finetuning.py) trains plain transcript-to-audio-token generation.

### 4. Training Control And Checkpoint Hygiene

- Add explicit stage budget controls such as `max_steps` or token-budget targets instead of deriving the entire run from one effective pass in [tts/training/main.py](tts/training/main.py).
- Make scheduler floors configurable in [tts/core/optimization.py](tts/core/optimization.py) and improve AdamW param grouping for long runs.
- Extend checkpoint state in [tts/training/checkpointing.py](tts/training/checkpointing.py) and resume logic in [tts/training/training_loop.py](tts/training/training_loop.py) to persist sampler/RNG progress and avoid linear fast-forwarding.
- Add best-checkpoint promotion and true global validation aggregation in [tts/training/evaluation.py](tts/training/evaluation.py).

### 5. Evaluation And Monitoring Arsenal

- Replace placeholder validation assets in [tts/inference/quality_validation.py](tts/inference/quality_validation.py) with config-driven multilingual canary sets and fixed decode prompt packs.
- Extend logging in [tts/utils/custom_logging.py](tts/utils/custom_logging.py) and [tts/training/evaluation.py](tts/training/evaluation.py) to include per-language and per-dataset reporting, pad ratio, throughput, repeat/stop failures, and decode success stats.
- Keep one tiny in-loop CE validation set, then add heavier decode eval outside the main training loop so model selection is not based on CE alone.
- Add run-gating checks: vectorizer dry run, 32-sample overfit mode, and a 1k-sample fixed canary eval before large-cluster launches.

### 6. Runbooks And Staged Configs

- Prepare separate configs/runbooks for: smoke PT, main PT, and base SFT using the current LLaMA path.
- Freeze the first serious recipe as: PT audio codes + small HQ paired bootstrap subset + text-only tokens, followed by native-script paired SFT only.
- Treat long-context, code-mixed/romanized augmentation, control tokens, and RLHF as follow-on phases that begin only after the baseline passes fixed canaries.

## Recommended Execution Order

1. Build the missing PT data builders and multilingual-safe manifests.
2. Fix dataset mixing and SFT batching efficiency.
3. Harden checkpointing, eval aggregation, and fixed decode canaries.
4. Add prompt-conditioned SFT only if first-run product goals require stronger cloning behavior.
5. Then run a tiny smoke job and overfit test before any serious cluster launch.

