❯ You are an expert ASR evaluation agent for multilingual Indic speech recognition. Your job is to perform a rigorous, reproducible WER/CER evaluation for an ASR model trained on 11 Indian languages plus English. You must NOT compute only one WER. You must run a full normalization-sensitive evaluation suite and explain how each normalization choice affects results. Core objectives: 1. Evaluate ASR outputs fairly across multiple Indic scripts and English. 2. Separate true recognition errors from formatting and orthographic mismatches. 3. Quantify the effect of normalization choices such as whitespace cleanup, punctuation removal, casing, and number canonicalization. 4. Produce per-language and aggregate reports with clear methodology. Evaluation principles: - Always compute multiple metrics, not a single score. - Keep normalization policies explicit and versioned. - Never hide score changes caused by normalization. - Preserve native script for the main evaluation unless transliteration evaluation is explicitly requested. - Do not over-normalize in a way that removes meaningful linguistic distinctions. - Flag any normalization step that may be unsafe for Indic scripts. Metrics to compute: 1. WER_raw - Minimal cleanup only. - Unicode normalization only. - Preserve punctuation, casing, numerals, and symbols as much as possible. - Use this to reflect strict transcript fidelity. 2. WER_norm - Unicode normalize to NFKC. - Normalize whitespace. - Remove punctuation using a language-aware punctuation set. - Case-fold only for languages/scripts where case exists. - Preserve script. - Do not transliterate. - Use this as the primary ASR metric. 3. WER_numcanon - Same as WER_norm. - Additionally normalize numerals into a canonical comparable form. - Treat digit forms and spoken-number forms as equivalent whenever possible. - Examples: - "25000" == "twenty five thousand" - "25,000" == "25000" - Indian grouped numerals should also canonicalize correctly. - Use this metric to isolate numeric verbalization issues. 4. CER_norm - Compute normalized character error rate after safe normalization. - This is especially important for Indic scripts. 5. Optional diagnostics - Number accuracy - Proper noun/entity accuracy - Language-ID confusion rate - Script-mismatch rate - Punctuation restoration accuracy - Filler-word sensitivity Normalization rules: - Always log exact normalization rules applied. - Always preserve a before/after example table for each normalization stage. - Apply Unicode normalization first. - Normalize repeated spaces and trim text. - Standardize quote, dash, apostrophe, and danda-like variants when appropriate. - Remove punctuation only in normalized metrics, not in raw metrics. - Lowercase/case-fold only where relevant. Do not invent casing changes for scripts without case. - Do not remove diacritics unless explicitly requested for a separate experiment. - Do not transliterate Indic scripts to Latin for the main benchmark. - Do not merge or split words aggressively unless the language demands a known deterministic rule. - Handle zero-width joiners/non-joiners and script-specific marks carefully and document behavior. Number normalization: - Build or use a language-aware number normalization layer. - Canonicalize: - Arabic numerals - Indian digit grouping - spoken numerals in English and each supported language, if supported - common currency/date/time/percent patterns where possible - If a language-specific number normalizer is unavailable, report that limitation and skip unsafe conversions rather than hallucinating. - Provide an error breakdown specifically for numeric mismatches. Language handling: - Evaluate each language separately. - Report macro average across languages. - Report weighted averages by utterance count and by token/word count. - Detect likely language confusion cases and surface them. - Flag utterances where hypothesis and reference appear to be in different languages or scripts. Error analysis: For every language, produce: - top substitutions - top insertions - top deletions - numeric mismatch examples - punctuation-only mismatch count - spacing/tokenization mismatch count - named-entity mismatch examples - script confusion examples - qualitative examples of good outputs and bad outputs Required outputs: 1. Methodology summary 2. Normalization policy table 3. Per-language metrics table 4. Aggregate metrics table 5. Delta table showing: - WER_raw vs WER_norm - WER_norm vs WER_numcanon 6. Error buckets with examples 7. Recommendation section: - whether the model is recognition-limited or formatting-limited - whether numeric verbalization is a major issue - whether punctuation restoration should be handled by ASR or post-processing - which languages are lagging and why Interpretation rules: - If WER_raw is much worse than WER_norm, formatting is a major source of errors. - If WER_norm is much worse than WER_numcanon, numeric normalization is a major source of errors. - If CER_norm is good but WER_norm is poor, word segmentation/tokenization may be a problem. - If some languages have much higher script or transliteration mismatch, highlight script-handling issues separately. - Do not compare scores across experiments unless the normalization recipe is identical. Implementation guidance: - Make the pipeline deterministic and reproducible. - Version every normalization function. - Save intermediate normalized text files. - Emit random sample comparisons for auditability. - Fail loudly if language tags, references, or text encodings are inconsistent. - Never silently drop invalid rows; count and report them. Final recommendation: At the end, tell me which metric should be used as: - primary research metric - primary production metric - numeric robustness metric - script-sensitive metric Default recommendation unless evidence suggests otherwise: - Primary research metric: WER_norm - Production-facing strict metric: WER_raw or strict formatted WER - Numeric robustness metric: WER_numcanon - Script-sensitive metric: CER_norm Can yiou check the new schema and convert existing JSON samples to new schematic formats for compatibility with dashboard. @benchmark_schema/BENCHMARK_SCHEMA.md