## Observation

The `260k` English spike is **real in the artifact**, but it is **not a broad English collapse**.

Two things stand out from `260k.json`:

1. The bad English is concentrated in a tiny set of ultra-short clips, not spread across the whole English set.
2. Those failures are **runaway repetition to `max_new_tokens`**, not prompt leakage or language confusion.

Examples:

```10643:10650:/workspace/maya-asr/260k.json
      "id": "en_0064",
      "language": "english",
      "lang_code": "en",
      "source": "Svarah",
      "duration": 0.2343125,
      "reference": "Up",
      "hypothesis": "okay, okay, okay, okay, okay, okay, okay, okay, ...",
      "detected_language": "English"
```

```13023:13030:/workspace/maya-asr/260k.json
      "id": "en_0302",
      "language": "english",
      "lang_code": "en",
      "source": "Svarah",
      "duration": 0.297,
      "reference": "Up",
      "hypothesis": "I am am am am am am am am am am am am ...",
      "detected_language": "English"
```

What I measured from the `260k` artifact:

- All `500/500` English outputs stay in English script.
- `0` English outputs contain prompt junk like `<|...|>` or byte junk like `<0x..>`.
- `37` English benchmark clips are `<0.5s`.
- Your training cleanup explicitly removed `<0.5s` utterances, so these are out-of-distribution.
- Just `8/500` English samples account for about **59% of all English errors**.
- English full WER is `56.31`, but on English clips `>=0.5s` it is about **`22.61`**.
- Overall full WER is `34.04`, but without all `<0.5s` clips it is about **`31.38`**.

So the headline is:

- **Indic gains look real.**
- **The scary English number is heavily exaggerated by short-clip degeneration.**
- There may still be some English retention pressure issue, but `56.31` is overstating it a lot.

## Benchmark Path

Your benchmark script is also not using the same guarded post-processing path as the model’s batched `transcribe` flow.

Current benchmark path:

```399:410:/workspace/maya-asr/benchmark_cohere.py
        with torch.no_grad():
            try:
                outputs = model.generate(
                    input_features=input_features,
                    length=length,
                    decoder_input_ids=decoder_input_ids,
                    max_new_tokens=max_new_tokens,
                )
                for i, meta in enumerate(batch_meta):
                    hyp = decode_tokens(tokenizer, outputs[i]).strip()
                    detected = detect_output_language(hyp)
```

Model’s safer batched path:

```1202:1270:/workspace/training/tokenizer_extension/extended_model/modeling_cohere_asr.py
            inputs = processor(audio=batch_waves, text=prompts, sampling_rate=batch_srs[0], return_tensors="pt")
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
            if "input_ids" in inputs and "decoder_input_ids" not in inputs:
                inputs["decoder_input_ids"] = inputs.pop("input_ids")
            if "decoder_input_ids" in inputs and "decoder_attention_mask" not in inputs:
                ...
            generated_ids = self.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                num_beams=1,
                decoder_start_token_id=int(inputs["decoder_input_ids"][0, 0].item()),
                use_cache=True,
            )
            ...
                if starts_with_prompt:
                    token_ids = token_ids[prompt_len:]
                if eos_token_id is not None:
                    try:
                        token_ids = token_ids[: token_ids.index(eos_token_id)]
```

I would be careful here: this benchmark mismatch is **not** inventing the repeated English outputs from nothing. Those loops are in the actual decoded artifact. But it does mean I would not treat extreme edge-case numbers as fully trustworthy until the benchmark matches serving behavior more closely.

## Best Next Course

1. **Do not interpret `56.31` as “English broadly collapsed.”**  
   The stronger interpretation is: `260k` has a **short-utterance degeneration problem**, and because English has many tiny command-like clips, corpus WER gets nuked.

2. **Patch `benchmark_cohere.py` before making training decisions from this table.**  
   I’d make it:
   - follow the same prompt/EOS trimming logic as `transcribe`
   - report `full` and `>=0.5s` metrics separately
   - report a `degenerate_decode_count`
   - report how many samples hit `max_new_tokens`

3. **Treat this as two separate problems, not one.**
   - Problem A: short-clip decode instability on OOD micro-utterances
   - Problem B: likely weaker English retention because your sampler strongly flattens language mix

4. **Do not restart the run because of this benchmark alone.**  
   The Indic improvements are too consistent for that. I’d only call this a true English-collapse training problem if English `>=0.5s` also degrades after re-benchmarking with the patched evaluator.

5. **For the next training phase, tune retention, not the whole pipeline.**
   - lower sampling temperature from `5.0` to something like `2` or `3`
   - give English a floor in sampling
   - resume with a lower LR than `2e-4`

If you want, I can patch `benchmark_cohere.py` now so it reports the right slices and makes this `200k vs 260k` comparison trustworthy.