# Qwen3-ASR Indic Benchmark Analysis Report
## ckpt-24000 vs ckpt-72000 on indic-asr-benchmark-6k

### Metrics Explanation

| Metric | Description | Industry Standard? |
|--------|-------------|--------------------|
| **WER** (Word Error Rate) | Primary ASR metric. Measures (insertions + deletions + substitutions) / reference words. | Yes - universal standard |
| **CER** (Character Error Rate) | Same as WER but at character level. Critical for morphologically rich / agglutinative languages (Tamil, Malayalam, Kannada). | Yes - especially for CJK, Indic |
| **WER-normalized** | WER after text normalization: NFC unicode, lowercase, punctuation removal, whitespace collapse. Removes surface-level mismatches. | Yes - all major benchmarks (OpenASR, CommonVoice) normalize |
| **CER-normalized** | Same normalization applied to CER. | Yes |

*Note: "Space-normalized WER" isn't a formal separate metric - it's part of standard normalization. Our WER-normalized covers this.*

---

### Results Summary

| Language    | WER 24k | WER 72k | Delta  | Rel. Improv. | CER 24k | CER 72k | Delta   |
|-------------|---------|---------|--------|--------------|---------|---------|---------|
| assamese    | 58.52   | 57.44   | -1.08  | +1.8%        | 29.59   | 29.44   | -0.15   |
| bengali     | 43.59   | 35.02   | -8.57  | **+19.7%**   | 21.97   | 13.77   | -8.20   |
| english     | 27.66   | 26.10   | -1.56  | +5.6%        | 11.77   | 11.31   | -0.46   |
| gujarati    | 31.12   | 28.18   | -2.94  | +9.4%        | 12.15   | 10.24   | -1.91   |
| hindi       | 14.07   | 14.04   | -0.03  | +0.2%        | 5.17    | 5.03    | -0.14   |
| kannada     | 55.09   | 48.83   | -6.26  | +11.4%       | 20.77   | 15.71   | -5.06   |
| malayalam   | 60.10   | 55.72   | -4.38  | +7.3%        | 17.31   | 15.46   | -1.85   |
| **marathi** | 62.82   | 43.35   | **-19.47** | **+31.0%** | 35.92 | 17.33   | **-18.59** |
| odia        | 48.37   | 44.75   | -3.62  | +7.5%        | 19.28   | 17.23   | -2.05   |
| **punjabi** | 32.41   | 35.01   | **+2.60** | **-8.0%** | 14.89   | 17.57   | +2.68   |
| tamil       | 62.92   | 53.33   | -9.59  | +15.2%       | 28.07   | 16.94   | -11.13  |
| telugu      | 46.11   | 41.89   | -4.22  | +9.2%        | 12.38   | 10.62   | -1.76   |
| **OVERALL** | 45.38   | 40.58   | -4.80  | +10.6%       | 19.83   | 15.45   | -4.38   |
| **MACRO**   | 45.23   | 40.30   | -4.93  | +10.9%       | 19.11   | 15.05   | -4.06   |

---

### Key Findings

#### 1. Strong overall progress: 48k more steps = ~11% relative WER improvement
- **Overall WER**: 45.38% -> 40.58% (-4.80 absolute, +10.6% relative)
- **Overall CER**: 19.83% -> 15.45% (-4.38 absolute, +22.1% relative)
- The model is clearly learning and improving across the board.

#### 2. Marathi: The biggest winner (+31% relative improvement)
- WER dropped from 62.82% to 43.35% (-19.47 absolute)
- CER dropped from 35.92% to 17.33% (-18.59 absolute)
- **Root cause**: At ckpt-24000, the model frequently **misidentified Marathi as Gujarati** (both use similar Devanagari/Gujarati scripts). By ckpt-72000, the language detection is fixed - the model correctly identifies and transcribes in Marathi script.
- This is a language confusion fix, not just transcription accuracy improvement.

#### 3. Punjabi: The only regression (-8% relative)
- WER went UP from 32.41% to 35.01% (+2.60 absolute)
- CER went UP from 14.89% to 17.57% (+2.68 absolute)
- Sample analysis shows the regression is subtle - both checkpoints handle common words well, but ckpt-72000 makes slightly more errors on proper nouns and place names. This could be catastrophic forgetting or training data distribution shift at 72k steps.
- **Action needed**: Check if Punjabi training data proportion dropped, or if the learning rate schedule is causing instability for this language.

#### 4. Language tiers by difficulty

**Tier 1 - Strong (<20% WER-norm):**
- Hindi: 10.04% (already near-converged at 24k, barely improved)
- English: 14.66% (good baseline, slight regression on normalized WER)

**Tier 2 - Good (20-30% WER-norm):**
- Gujarati: 22.86%
- Punjabi: 28.46%

**Tier 3 - Moderate (30-40% WER-norm):**
- Bengali: 30.42% (big improvement from 39.21%)
- Telugu: 34.14%
- Marathi: 37.86% (massive jump from 58.72%)
- Odia: 38.69%

**Tier 4 - Needs work (>40% WER-norm):**
- Kannada: 42.13%
- Tamil: 47.82%
- Malayalam: 49.65%
- Assamese: 53.38%

#### 5. Tamil: interesting progress pattern
- Large WER improvement (-9.59) but the model still hallucinates in wrong scripts sometimes.
- At 24k, one Tamil sample was transcribed in **Malayalam script** entirely. By 72k, this is fixed to Tamil.
- Shows the model is still resolving script/language boundaries for Dravidian languages.

#### 6. Hindi & English: already plateaued
- Hindi barely moved (14.07 -> 14.04 WER), suggesting the base Qwen3 model already had strong Hindi capability from pretraining.
- English similarly stable. Further gains for these languages would likely need domain-specific data, not more steps.

---

### Recommendations

1. **Continue training** - 11% relative improvement in 48k steps is solid progress. Most languages are still improving.
2. **Investigate Punjabi regression** - Could be data distribution, LR schedule, or catastrophic forgetting. Consider adding more Punjabi data.
3. **Dravidian languages need more data** - Tamil/Malayalam/Kannada are the weakest. Their agglutinative morphology makes them harder; more training data + CER-focused loss weighting could help.
4. **Assamese plateau** - Only 1.8% relative improvement suggests the model is struggling. Assamese has the least training data (from IndicVoices vs Kathbath for others); data quantity may be the bottleneck.
5. **Next checkpoint eval** - Run again at ~120k-150k steps to see if Punjabi regression persists and whether Tier 3/4 languages continue improving.

---

### RTF (Real-Time Factor)

| Checkpoint | Inference Time | Audio Duration | RTF |
|------------|---------------|----------------|-----|
| ckpt-24000 | 2472.6s | 40354s | 0.0613 |
| ckpt-72000 | ~2500s | 40354s | ~0.062 |

Both checkpoints run at ~16x real-time on single GPU with batch_size=16, which is good for evaluation throughput.