This is **not** a simple “English forgot everything” story, and it’s also **not** just the benchmark lying. It looks like a real training-side tradeoff plus a small but catastrophic decode-loop failure mode. ## What `260k.json` says The strongest signal is this: - English is still being transcribed **in English**. - It is **not** mostly blank outputs or script confusion. - The huge English WER jump is dominated by a **tiny set of catastrophic looping samples**. I checked `260k.json` directly: - English samples: `500` - English language-detection mismatches: `0` - Empty English hypotheses: `0` - English `WER-norm`: `56.31` - English `WER-numcanon`: `55.92` So number formatting is **not** the reason for the collapse. More importantly, I counted the failure mode: - `26/500` English samples have repeated consecutive words - `8/500` have catastrophic loops with run length `>= 4` - If I exclude just those `8` looped samples, English `WER-norm` drops from **56.31 -> 23.03** That means the “English collapsed by 29 points” story is mostly being driven by a handful of disastrous generations, not by broad degradation across all English samples. Here are real examples from `260k.json`: ```10643:10650:/workspace/maya-asr/260k.json "id": "en_0064", "language": "english", "duration": 0.2343125, "reference": "Up", "hypothesis": "okay, okay, okay, okay, okay, okay, okay, okay, okay, okay, okay, okay, ...", "detected_language": "English" ``` ```10853:10860:/workspace/maya-asr/260k.json "id": "en_0085", "language": "english", "duration": 0.281625, "reference": "Two", "hypothesis": "two two two two two two two two two two two two two two two two ...", "detected_language": "English" ``` At the same time, many English samples still look normal or near-normal: ```10003:10010:/workspace/maya-asr/260k.json "id": "en_0000", "language": "english", "reference": "what's the recipe for pasta sauce", "hypothesis": "What's the recipe for past the song?", "detected_language": "English" ``` So this is not “the model switched to Indic for English.” It is more like “a small subset of English clips, especially very short ones, are hitting a repetition / EOS failure mode.” ## Why this is happening I think there are **two** things interacting. ### 1. The sampler really is starving English I measured the effective training mix from the actual shard indexes under your current `temperature=5.0` logic. Raw dataset share: - English: `34.1%` Effective shard-level temperature share: - English: `11.8%` That is a huge downweighting. By contrast, very small languages get boosted hard. So yes: the “English vs Indic seesaw” theory is **basically correct**. That comes directly from the current sampler logic: ```60:86:/workspace/training/dataset_v2.py if config.temperature > 0 and split == "train": lang_counts = {} lang_shard_counts = {} ... total = sum(lang_counts.values()) if total > 0 and lang_counts: lang_probs = {l: (c / total) ** (1.0 / config.temperature) for l, c in lang_counts.items()} prob_sum = sum(lang_probs.values()) self._lang_probs = {l: p / prob_sum for l, p in lang_probs.items()} self._shard_weights = { s: self._lang_probs.get(self._shard_langs[s], 1.0 / len(self._lang_probs)) / max(lang_shard_counts.get(self._shard_langs[s], 1), 1) for s in self.all_shards } ``` ### 2. You are not just training the new vocab rows The newly added embedding rows are only about `44.6M` params, which is about **2.1%** of the saved model. So this is **not** mainly “the new random blocks just need more time.” Your run log shows the real issue: essentially the whole model is trainable. ```17:21:/workspace/maya-asr/checkpoints_phase1_2026-04-07/train.log 2026-04-07 03:08:57 [INFO] Pretrained weights manually reloaded (bypassing _init_weights bug) 2026-04-07 03:08:57 [INFO] Parameters: 2.09B total, 2.09B trainable, 0 frozen groups 2026-04-07 03:09:18 [INFO] Dataset V2: pre-tokenized, exact model mels, 8 workers, batch_frames=400000, batch_utts=32 2026-04-07 03:09:19 [INFO] Optimizer: AdamW, base_lr=0.0002, LLRD=0.9, max_encoder_depth=47 ``` So you are effectively doing a near-full-model finetune, while English is only getting ~12% of the sampling budget. Also, at `260k` steps your base LR is still about **`1.18e-4`** on the cosine schedule. That is **not low**. If anything, for preserving English decoder behavior, it is still fairly aggressive. ## Is `benchmark_cohere.py` the problem? Not really, but it **does amplify** this particular failure mode. The script is doing fixed-length generation with `max_new_tokens=256` for every sample, including ultra-short English clips: ```399:406:/workspace/maya-asr/benchmark_cohere.py with torch.no_grad(): try: outputs = model.generate( input_features=input_features, length=length, decoder_input_ids=decoder_input_ids, max_new_tokens=max_new_tokens, ) ``` For a `0.23s` clip like `"Up"`, if the model misses EOS and starts looping, the benchmark gives it the full 256-token disaster, and WER explodes. So: - the benchmark is **not inventing** the issue - but the fixed `256` cap makes these short-clip loop failures look even worse in aggregate English WER That means the 260k English regression is **real**, but its magnitude is being dominated by a small decode pathology rather than broad English incompetence. ## My take - If your goal is **best Indic checkpoint**, `260k` looks very strong. - If your goal is **balanced multilingual including English**, `260k` is **not** the checkpoint I would ship. - I do **not** think the right interpretation is “the new vocab rows still need time.” - I do think the right interpretation is: - English is under-sampled too aggressively - the full decoder is being moved too much - a few catastrophic loop failures are blowing up English WER ## What I’d have your training agent check next 1. Re-benchmark the English subset with a duration-aware decode cap, e.g. `max_new_tokens` tied to audio length, to see how much of the spike is EOS/looping versus true recognition drift. 2. Track loop-failure counts per checkpoint, not just WER. Right now that’s the hidden failure mode. 3. Compute and log the actual effective sampling mix each run. With `temperature=5.0`, English is only getting about `11.8%`. 4. Verify whether `0 frozen groups` is actually intended. That makes this run much more aggressive than “just adapting new tokens.” 5. If English matters, lower sampler temperature or add an English floor/quota for the next run. My practical recommendation: - Keep `200k` as the current **balanced** candidate. - Treat `260k` as the current **Indic-best but English-unstable** candidate. - Don’t assume continuing the same run will reliably “fix English later.” Under the current sampler, I would expect more of this seesaw. If you want, I can next help you design a better continuation strategy: - keep current run going for Indic-only, - or define a recovery phase for English retention, - or propose exact sampler / LR changes for the next run.