Now I have the full picture. Here are my suggestions for your inference agent: --- Key issues in the benchmark script: 1. validate_language() will crash on Indic languages The biggest problem: utils.py:SUPPORTED_LANGUAGES only has the 30 base Qwen3-ASR languages. It does NOT include Assamese, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, or Telugu. When the benchmark calls model.transcribe(language="Tamil"), it hits validate_language() → ValueError: Unsupported language: Tamil. Fix: The inference agent needs to either: - Add the Indic languages to SUPPORTED_LANGUAGES in qwen_asr/inference/utils.py - Or pass language=None and let the model auto-detect (the model was trained with language_tag_mode=auto which embeds language names in the target text, so auto-detect should work) 2. Language name mismatch The benchmark script's LANG_CODE_TO_NAME matches the training script's exactly — that's good. But _MODEL_SUPPORTED_LANGS in the benchmark only has the base 30 languages. So the force_language logic falls back to None for all Indic languages except Hindi/English. This means: - Hindi/English: forced language → model outputs raw text only - Other Indic: auto-detect → model outputs language Tamilactual text This inconsistency will mess up WER because parse_asr_output strips the language prefix when user_language is set, but keeps it otherwise. Fix: Either force all languages (after adding them to SUPPORTED_LANGUAGES) or force none. 3. How the model actually generates output During training, the target format is: language Tamilநான் வீட்டுக்கு போகிறேன் At inference with force_language="Tamil", the prompt is pre-filled with language Tamil and the model generates just the transcription text. Without forcing, the model generates the full language Xtext sequence and parse_asr_output extracts the text part. 4. The transcribe() API expects (np.ndarray, sr) tuples, not raw arrays The benchmark passes audio_inputs.append((wav, sr)) — that's correct. 5. Recommendation for the benchmark: # Option A: Add Indic languages to SUPPORTED_LANGUAGES and force language # This is cleaner and gives better results since the model doesn't waste # tokens on language detection. # In qwen_asr/inference/utils.py, add to SUPPORTED_LANGUAGES: # "Assamese", "Bengali", "Gujarati", "Kannada", "Malayalam", # "Marathi", "Odia", "Punjabi", "Tamil", "Telugu" # Option B: Don't force language, let auto-detect work # Simpler, but the model generates `language Xtext` # and parse_asr_output handles extraction. # Use: model.transcribe(audio=audio_inputs, language=None) 6. max_new_tokens=256 may be too low for longer Indic transcriptions — some of the 10-15s samples in the training data have 280+ character transcripts. Consider 448 or 512. 7. The checkpoint needs base model files The MakeEveryCheckpointInferableCallback copies required HF files (preprocessor_config, chat_template, etc.) into each checkpoint. Verify these exist in the checkpoint being benchmarked, otherwise from_pretrained will fail.