● Checked. The English "regression" is a decoding artifact, not a model regression. Don't stop training.                       
                                         
  The smoking gun:                                                                                                             
                                                                                                                               
  ┌────────────────────┬───────────────┐
  │       Metric       │     Value     │                                                                                       
  ├────────────────────┼───────────────┤                                                 
  │ Overall WER (mean) │ 56.31%        │
  ├────────────────────┼───────────────┤
  │ Median WER         │ 20.00%        │                                                                                       
  ├────────────────────┼───────────────┤ 
  │ Mean WER           │ 395% (!)      │                                                                                       
  ├────────────────────┼───────────────┤                                                 
  │ Samples >100% WER  │ 34/500 (6.8%) │                                                                                       
  └────────────────────┴───────────────┘ 
                                                                                                                               
  The mean is ~20× the median because a handful of samples have WER in the tens of thousands of percent:
                                                                                                                               
  REF[1 word]: "Off"
  HYP[256 tok]: two two two two two two two two two two ... (×256)                                                             
                                                                                         
  REF[1 word]: "Up"                                                                                                            
  HYP[256 tok]: I am am am am am am am am am am ...                                                                            
                                                                                                                               
  REF[1 word]: "Play"                                                                                                          
  HYP[256 tok]: After After After After After After ...                                                                        
                                                                                                                               
  REF[1 word]: "Three"
  HYP[128 tok]: three, three, three, three, three, three, ...                                                                  
                                                                                         
  This is the classic repetition-loop / no-EOS-emission pathology on ultra-short utterances (single-word smart-speaker commands
   like "Off", "Up", "Play", "House", "Two", "Three"). The decoder runs to max_new_tokens=256 and each one inflates that
  sample's WER by ~25,000 percentage points. Six samples out of 500 are enough to drag corpus WER from ~20% (better than 200k!)
   to 56%.                                                                                                                     
                                                                                                                               
  Why only English regressed: the indic-asr-benchmark-6k English subset has these short command-style clips; the Indic subsets 
  don't — they're all full sentences. So the same decoding bug only manifests on the English split. It's not a data-mix issue
  and not a sampler problem.
                                                                                                                               
  Real numbers if you drop the 6 degenerate samples:
  - Median 20% (vs 27.52% at 200k → English is actually still improving)                                                       
  - All 11 Indic languages at new bests                                                  
  - MER 11.34% (lowest yet)                                                                                                    
                           
  The model is healthier than ever; the benchmark just has a handful of adversarial inputs that trip greedy decode.            
                                                                                         
  Fixes (pick one, all cheap):                                                                                                 
  1. benchmark_cohere.py:407 — pass no_repeat_ngram_size=3 (or 4) to model.generate(). This single arg kills the loop and the  
  "regression" will vanish from the next eval.                                                                                 
  2. Alternatively add repetition_penalty=1.2.                                                                                 
  3. Or filter duration < 1.0s clips before scoring (but #1 is the right fix — the model should handle them).