Now I have the full picture. Here are my suggestions for your inference agent:

  ---
  Key issues in the benchmark script:

  1. validate_language() will crash on Indic languages

  The biggest problem: utils.py:SUPPORTED_LANGUAGES only has the 30 base Qwen3-ASR languages. It does NOT include Assamese, Bengali, Gujarati,
  Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, or Telugu. When the benchmark calls model.transcribe(language="Tamil"), it hits
  validate_language() → ValueError: Unsupported language: Tamil.

  Fix: The inference agent needs to either:
  - Add the Indic languages to SUPPORTED_LANGUAGES in qwen_asr/inference/utils.py
  - Or pass language=None and let the model auto-detect (the model was trained with language_tag_mode=auto which embeds language names in the
  target text, so auto-detect should work)

  2. Language name mismatch

  The benchmark script's LANG_CODE_TO_NAME matches the training script's exactly — that's good. But _MODEL_SUPPORTED_LANGS in the benchmark only
  has the base 30 languages. So the force_language logic falls back to None for all Indic languages except Hindi/English. This means:
  - Hindi/English: forced language → model outputs raw text only
  - Other Indic: auto-detect → model outputs language Tamil<asr_text>actual text

  This inconsistency will mess up WER because parse_asr_output strips the language prefix when user_language is set, but keeps it otherwise.

  Fix: Either force all languages (after adding them to SUPPORTED_LANGUAGES) or force none.

  3. How the model actually generates output

  During training, the target format is:
  language Tamil<asr_text>நான் வீட்டுக்கு போகிறேன்

  At inference with force_language="Tamil", the prompt is pre-filled with language Tamil<asr_text> and the model generates just the transcription
  text. Without forcing, the model generates the full language X<asr_text>text sequence and parse_asr_output extracts the text part.

  4. The transcribe() API expects (np.ndarray, sr) tuples, not raw arrays

  The benchmark passes audio_inputs.append((wav, sr)) — that's correct.

  5. Recommendation for the benchmark:

  # Option A: Add Indic languages to SUPPORTED_LANGUAGES and force language
  # This is cleaner and gives better results since the model doesn't waste
  # tokens on language detection.

  # In qwen_asr/inference/utils.py, add to SUPPORTED_LANGUAGES:
  # "Assamese", "Bengali", "Gujarati", "Kannada", "Malayalam",
  # "Marathi", "Odia", "Punjabi", "Tamil", "Telugu"

  # Option B: Don't force language, let auto-detect work
  # Simpler, but the model generates `language X<asr_text>text`
  # and parse_asr_output handles extraction.
  # Use: model.transcribe(audio=audio_inputs, language=None)

  6. max_new_tokens=256 may be too low for longer Indic transcriptions — some of the 10-15s samples in the training data have 280+ character
  transcripts. Consider 448 or 512.

  7. The checkpoint needs base model files

  The MakeEveryCheckpointInferableCallback copies required HF files (preprocessor_config, chat_template, etc.) into each checkpoint. Verify these
  exist in the checkpoint being benchmarked, otherwise from_pretrained will fail.