## Observation The `260k` English spike is **real in the artifact**, but it is **not a broad English collapse**. Two things stand out from `260k.json`: 1. The bad English is concentrated in a tiny set of ultra-short clips, not spread across the whole English set. 2. Those failures are **runaway repetition to `max_new_tokens`**, not prompt leakage or language confusion. Examples: ```10643:10650:/workspace/maya-asr/260k.json "id": "en_0064", "language": "english", "lang_code": "en", "source": "Svarah", "duration": 0.2343125, "reference": "Up", "hypothesis": "okay, okay, okay, okay, okay, okay, okay, okay, ...", "detected_language": "English" ``` ```13023:13030:/workspace/maya-asr/260k.json "id": "en_0302", "language": "english", "lang_code": "en", "source": "Svarah", "duration": 0.297, "reference": "Up", "hypothesis": "I am am am am am am am am am am am am ...", "detected_language": "English" ``` What I measured from the `260k` artifact: - All `500/500` English outputs stay in English script. - `0` English outputs contain prompt junk like `<|...|>` or byte junk like `<0x..>`. - `37` English benchmark clips are `<0.5s`. - Your training cleanup explicitly removed `<0.5s` utterances, so these are out-of-distribution. - Just `8/500` English samples account for about **59% of all English errors**. - English full WER is `56.31`, but on English clips `>=0.5s` it is about **`22.61`**. - Overall full WER is `34.04`, but without all `<0.5s` clips it is about **`31.38`**. So the headline is: - **Indic gains look real.** - **The scary English number is heavily exaggerated by short-clip degeneration.** - There may still be some English retention pressure issue, but `56.31` is overstating it a lot. ## Benchmark Path Your benchmark script is also not using the same guarded post-processing path as the model’s batched `transcribe` flow. Current benchmark path: ```399:410:/workspace/maya-asr/benchmark_cohere.py with torch.no_grad(): try: outputs = model.generate( input_features=input_features, length=length, decoder_input_ids=decoder_input_ids, max_new_tokens=max_new_tokens, ) for i, meta in enumerate(batch_meta): hyp = decode_tokens(tokenizer, outputs[i]).strip() detected = detect_output_language(hyp) ``` Model’s safer batched path: ```1202:1270:/workspace/training/tokenizer_extension/extended_model/modeling_cohere_asr.py inputs = processor(audio=batch_waves, text=prompts, sampling_rate=batch_srs[0], return_tensors="pt") inputs = {k: v.to(self.device) for k, v in inputs.items()} if "input_ids" in inputs and "decoder_input_ids" not in inputs: inputs["decoder_input_ids"] = inputs.pop("input_ids") if "decoder_input_ids" in inputs and "decoder_attention_mask" not in inputs: ... generated_ids = self.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1, decoder_start_token_id=int(inputs["decoder_input_ids"][0, 0].item()), use_cache=True, ) ... if starts_with_prompt: token_ids = token_ids[prompt_len:] if eos_token_id is not None: try: token_ids = token_ids[: token_ids.index(eos_token_id)] ``` I would be careful here: this benchmark mismatch is **not** inventing the repeated English outputs from nothing. Those loops are in the actual decoded artifact. But it does mean I would not treat extreme edge-case numbers as fully trustworthy until the benchmark matches serving behavior more closely. ## Best Next Course 1. **Do not interpret `56.31` as “English broadly collapsed.”** The stronger interpretation is: `260k` has a **short-utterance degeneration problem**, and because English has many tiny command-like clips, corpus WER gets nuked. 2. **Patch `benchmark_cohere.py` before making training decisions from this table.** I’d make it: - follow the same prompt/EOS trimming logic as `transcribe` - report `full` and `>=0.5s` metrics separately - report a `degenerate_decode_count` - report how many samples hit `max_new_tokens` 3. **Treat this as two separate problems, not one.** - Problem A: short-clip decode instability on OOD micro-utterances - Problem B: likely weaker English retention because your sampler strongly flattens language mix 4. **Do not restart the run because of this benchmark alone.** The Indic improvements are too consistent for that. I’d only call this a true English-collapse training problem if English `>=0.5s` also degrades after re-benchmarking with the patched evaluator. 5. **For the next training phase, tune retention, not the whole pipeline.** - lower sampling temperature from `5.0` to something like `2` or `3` - give English a floor in sampling - resume with a lower LR than `2e-4` If you want, I can patch `benchmark_cohere.py` now so it reports the right slices and makes this `200k vs 260k` comparison trustworthy.