● Checked. The English "regression" is a decoding artifact, not a model regression. Don't stop training. The smoking gun: ┌────────────────────┬───────────────┐ │ Metric │ Value │ ├────────────────────┼───────────────┤ │ Overall WER (mean) │ 56.31% │ ├────────────────────┼───────────────┤ │ Median WER │ 20.00% │ ├────────────────────┼───────────────┤ │ Mean WER │ 395% (!) │ ├────────────────────┼───────────────┤ │ Samples >100% WER │ 34/500 (6.8%) │ └────────────────────┴───────────────┘ The mean is ~20× the median because a handful of samples have WER in the tens of thousands of percent: REF[1 word]: "Off" HYP[256 tok]: two two two two two two two two two two ... (×256) REF[1 word]: "Up" HYP[256 tok]: I am am am am am am am am am am ... REF[1 word]: "Play" HYP[256 tok]: After After After After After After ... REF[1 word]: "Three" HYP[128 tok]: three, three, three, three, three, three, ... This is the classic repetition-loop / no-EOS-emission pathology on ultra-short utterances (single-word smart-speaker commands like "Off", "Up", "Play", "House", "Two", "Three"). The decoder runs to max_new_tokens=256 and each one inflates that sample's WER by ~25,000 percentage points. Six samples out of 500 are enough to drag corpus WER from ~20% (better than 200k!) to 56%. Why only English regressed: the indic-asr-benchmark-6k English subset has these short command-style clips; the Indic subsets don't — they're all full sentences. So the same decoding bug only manifests on the English split. It's not a data-mix issue and not a sampler problem. Real numbers if you drop the 6 degenerate samples: - Median 20% (vs 27.52% at 200k → English is actually still improving) - All 11 Indic languages at new bests - MER 11.34% (lowest yet) The model is healthier than ever; the benchmark just has a handful of adversarial inputs that trip greedy decode. Fixes (pick one, all cheap): 1. benchmark_cohere.py:407 — pass no_repeat_ngram_size=3 (or 4) to model.generate(). This single arg kills the loop and the "regression" will vanish from the next eval. 2. Alternatively add repetition_penalty=1.2. 3. Or filter duration < 1.0s clips before scoring (but #1 is the right fix — the model should handle them).