This is **not** a simple “English forgot everything” story, and it’s also **not** just the benchmark lying. It looks like a real training-side tradeoff plus a small but catastrophic decode-loop failure mode.

## What `260k.json` says
The strongest signal is this:

- English is still being transcribed **in English**.
- It is **not** mostly blank outputs or script confusion.
- The huge English WER jump is dominated by a **tiny set of catastrophic looping samples**.

I checked `260k.json` directly:

- English samples: `500`
- English language-detection mismatches: `0`
- Empty English hypotheses: `0`
- English `WER-norm`: `56.31`
- English `WER-numcanon`: `55.92`

So number formatting is **not** the reason for the collapse.

More importantly, I counted the failure mode:
- `26/500` English samples have repeated consecutive words
- `8/500` have catastrophic loops with run length `>= 4`
- If I exclude just those `8` looped samples, English `WER-norm` drops from **56.31 -> 23.03**

That means the “English collapsed by 29 points” story is mostly being driven by a handful of disastrous generations, not by broad degradation across all English samples.

Here are real examples from `260k.json`:

```10643:10650:/workspace/maya-asr/260k.json
      "id": "en_0064",
      "language": "english",
      "duration": 0.2343125,
      "reference": "Up",
      "hypothesis": "okay, okay, okay, okay, okay, okay, okay, okay, okay, okay, okay, okay, ...",
      "detected_language": "English"
```

```10853:10860:/workspace/maya-asr/260k.json
      "id": "en_0085",
      "language": "english",
      "duration": 0.281625,
      "reference": "Two",
      "hypothesis": "two two two two two two two two two two two two two two two two ...",
      "detected_language": "English"
```

At the same time, many English samples still look normal or near-normal:

```10003:10010:/workspace/maya-asr/260k.json
      "id": "en_0000",
      "language": "english",
      "reference": "what's the recipe for pasta sauce",
      "hypothesis": "What's the recipe for past the song?",
      "detected_language": "English"
```

So this is not “the model switched to Indic for English.” It is more like “a small subset of English clips, especially very short ones, are hitting a repetition / EOS failure mode.”

## Why this is happening
I think there are **two** things interacting.

### 1. The sampler really is starving English
I measured the effective training mix from the actual shard indexes under your current `temperature=5.0` logic.

Raw dataset share:
- English: `34.1%`

Effective shard-level temperature share:
- English: `11.8%`

That is a huge downweighting.

By contrast, very small languages get boosted hard. So yes: the “English vs Indic seesaw” theory is **basically correct**.

That comes directly from the current sampler logic:

```60:86:/workspace/training/dataset_v2.py
        if config.temperature > 0 and split == "train":
            lang_counts = {}
            lang_shard_counts = {}
            ...
            total = sum(lang_counts.values())
            if total > 0 and lang_counts:
                lang_probs = {l: (c / total) ** (1.0 / config.temperature)
                              for l, c in lang_counts.items()}
                prob_sum = sum(lang_probs.values())
                self._lang_probs = {l: p / prob_sum for l, p in lang_probs.items()}
                self._shard_weights = {
                    s: self._lang_probs.get(self._shard_langs[s], 1.0 / len(self._lang_probs))
                    / max(lang_shard_counts.get(self._shard_langs[s], 1), 1)
                    for s in self.all_shards
                }
```

### 2. You are not just training the new vocab rows
The newly added embedding rows are only about `44.6M` params, which is about **2.1%** of the saved model. So this is **not** mainly “the new random blocks just need more time.”

Your run log shows the real issue: essentially the whole model is trainable.

```17:21:/workspace/maya-asr/checkpoints_phase1_2026-04-07/train.log
2026-04-07 03:08:57 [INFO] Pretrained weights manually reloaded (bypassing _init_weights bug)
2026-04-07 03:08:57 [INFO] Parameters: 2.09B total, 2.09B trainable, 0 frozen groups
2026-04-07 03:09:18 [INFO] Dataset V2: pre-tokenized, exact model mels, 8 workers, batch_frames=400000, batch_utts=32
2026-04-07 03:09:19 [INFO] Optimizer: AdamW, base_lr=0.0002, LLRD=0.9, max_encoder_depth=47
```

So you are effectively doing a near-full-model finetune, while English is only getting ~12% of the sampling budget.

Also, at `260k` steps your base LR is still about **`1.18e-4`** on the cosine schedule. That is **not low**. If anything, for preserving English decoder behavior, it is still fairly aggressive.

## Is `benchmark_cohere.py` the problem?
Not really, but it **does amplify** this particular failure mode.

The script is doing fixed-length generation with `max_new_tokens=256` for every sample, including ultra-short English clips:

```399:406:/workspace/maya-asr/benchmark_cohere.py
        with torch.no_grad():
            try:
                outputs = model.generate(
                    input_features=input_features,
                    length=length,
                    decoder_input_ids=decoder_input_ids,
                    max_new_tokens=max_new_tokens,
                )
```

For a `0.23s` clip like `"Up"`, if the model misses EOS and starts looping, the benchmark gives it the full 256-token disaster, and WER explodes.

So:
- the benchmark is **not inventing** the issue
- but the fixed `256` cap makes these short-clip loop failures look even worse in aggregate English WER

That means the 260k English regression is **real**, but its magnitude is being dominated by a small decode pathology rather than broad English incompetence.

## My take
- If your goal is **best Indic checkpoint**, `260k` looks very strong.
- If your goal is **balanced multilingual including English**, `260k` is **not** the checkpoint I would ship.
- I do **not** think the right interpretation is “the new vocab rows still need time.”
- I do think the right interpretation is:
  - English is under-sampled too aggressively
  - the full decoder is being moved too much
  - a few catastrophic loop failures are blowing up English WER

## What I’d have your training agent check next
1. Re-benchmark the English subset with a duration-aware decode cap, e.g. `max_new_tokens` tied to audio length, to see how much of the spike is EOS/looping versus true recognition drift.
2. Track loop-failure counts per checkpoint, not just WER. Right now that’s the hidden failure mode.
3. Compute and log the actual effective sampling mix each run. With `temperature=5.0`, English is only getting about `11.8%`.
4. Verify whether `0 frozen groups` is actually intended. That makes this run much more aggressive than “just adapting new tokens.”
5. If English matters, lower sampler temperature or add an English floor/quota for the next run.

My practical recommendation:
- Keep `200k` as the current **balanced** candidate.
- Treat `260k` as the current **Indic-best but English-unstable** candidate.
- Don’t assume continuing the same run will reliably “fix English later.” Under the current sampler, I would expect more of this seesaw.

If you want, I can next help you design a better continuation strategy:
- keep current run going for Indic-only,
- or define a recovery phase for English retention,
- or propose exact sampler / LR changes for the next run.