# ASR Benchmark Schema

Canonical JSON schema for all ASR benchmark outputs. Every agent MUST conform to this schema.

---

## Directory Layout

```
/home/ubuntu/training/benchmark_outputs/
  <model_id>/
    <checkpoint_name>/
      metrics.json
      sample_analysis.json
      error_analysis.json
```

## Model Registry

| Model ID         | Display Name                    | Checkpoints              |
|------------------|---------------------------------|--------------------------|
| `qwen3-asr`     | Qwen3-ASR-1.7B                 | ckpt-24000, 72000, 100000|
| `gemma3n-e2b`   | Gemma-3n-E2B-ASR               | ckpt-10000, 20000        |
| `parakeet-1.1b` | Parakeet-1.1B-Language-Guided  | baseline (placeholder)   |

## Languages

| Language   | Code | Source                  | Samples |
|------------|------|-------------------------|---------|
| assamese   | as   | IndicVoices (valid)     | 500     |
| bengali    | bn   | Kathbath (test_known)   | 500     |
| english    | en   | Svarah                  | 500     |
| gujarati   | gu   | Kathbath (test_known)   | 500     |
| hindi      | hi   | Kathbath (test_known)   | 500     |
| kannada    | kn   | Kathbath (test_known)   | 500     |
| malayalam  | ml   | Kathbath (test_known)   | 500     |
| marathi    | mr   | Kathbath (test_known)   | 500     |
| odia       | or   | Kathbath (test_known)   | 500     |
| punjabi    | pa   | Kathbath (test_known)   | 500     |
| tamil      | ta   | Kathbath (test_known)   | 500     |
| telugu     | te   | Kathbath (test_known)   | 500     |

---

## Metric Tiers

Six metric tiers are computed. Each builds on standard normalization with one additional step.

| Metric           | Unicode | Whitespace     | Punctuation Removal | Case Fold | Number Canon | Purpose                              |
|------------------|---------|----------------|---------------------|-----------|--------------|--------------------------------------|
| `wer_raw`        | NFC     | trim only      | No                  | No        | No           | Strict transcript fidelity           |
| `wer_norm`       | NFKC    | normalize      | Yes (lang-aware)    | Yes*      | No           | **Primary ASR metric**               |
| `wer_numcanon`   | NFKC    | normalize      | Yes (lang-aware)    | Yes*      | Yes          | Numeric robustness                   |
| `space_norm_wer` | NFKC    | **removed for alignment** | Yes (lang-aware) | Yes* | No   | Word-level after space-insensitive alignment |
| `mer`            | NFKC    | **all removed**| Yes (lang-aware)    | Yes*      | No           | Meaningful Error Rate (char-level)   |
| `cer_norm`       | NFKC    | normalize      | Yes (lang-aware)    | Yes*      | No           | Script-sensitive / Indic             |

*Case fold only for scripts with case (Latin). Indic scripts are unaffected.

### MER — Meaningful Error Rate (Space-Normalized WER)

MER measures whether the model got the **content** right regardless of word boundary placement. It answers: "Did the model produce the right characters in the right order, even if it split or merged words incorrectly?"

**How it works**: Remove ALL spaces from both reference and hypothesis after standard normalization, then compute WER on the resulting single-token strings. This means:

```
"india"     == "indi a"      ✓ (split word — not penalized)
"good"      == "go od"       ✓ (split word — not penalized)
"hello world" == "helloworld" ✓ (merged words — not penalized)
"நிலையான"    == "நிலை யான"    ✓ (Indic word split — not penalized)
"india"     != "indonesia"   ✗ (different content — still an error)
```

**Why this matters for Indic ASR**: Many Indic scripts have ambiguous word boundaries. Models frequently produce correct character sequences but with incorrect spacing — especially for agglutinative languages (Tamil, Malayalam, Kannada) and compound words in Hindi/Marathi. Standard WER penalizes these as multiple errors (insertion + substitution), inflating the score. MER isolates true recognition errors from tokenization/segmentation errors.

**Computation**:
```python
def space_normalize(text: str) -> str:
    """Remove all spaces after standard normalization."""
    return norm_standard(text, language).replace(" ", "")

mer = wer(space_normalize(ref), space_normalize(hyp))
```

Note: Since space-normalized text is a single "word", MER is effectively identical to CER on the space-stripped text. But we report it as a WER-scale metric for direct comparison with other WER tiers.

**Interpretation**:
- If `wer_norm` is much worse than `mer`, the model has a **word segmentation problem** — it knows the content but not the boundaries.
- If `wer_norm` ≈ `mer`, spacing is not a significant error source.
- The gap `wer_norm - mer` quantifies exactly how much WER is inflated by spacing errors.

### Normalization Pipeline (applied in order)

```
Step 1: Unicode normalize (NFC for raw, NFKC for norm/numcanon/mer)
Step 2: Strip zero-width joiners/non-joiners (U+200B..U+200F, U+FEFF)
Step 3: Normalize whitespace (collapse runs, trim)
Step 4: Standardize punctuation variants (curly quotes → straight, em-dash → hyphen, double-danda → single)
Step 5: Remove punctuation (norm/numcanon/mer only, using language-aware set)
Step 6: Case fold (English only)
Step 7: Number canonicalization (numcanon only — digit grouping removal, Arabic numeral unification)
Step 8: Space removal (mer only — remove ALL spaces to produce a single character stream)
Step 9: Space-insensitive word alignment (space_norm_wer only — see below)
```

### space_norm_wer — Space-Normalized Word Error Rate

Word error rate after forgiving whitespace boundaries by doing character alignment on space-stripped text and then counting how many reference words are touched by real content edits.

**Algorithm per sample:**
1. Split `ref_norm` into `ref_words`. If empty, return `(0, 0)`.
2. Remove all spaces: `ref_nospace = ref_norm.replace(" ", "")`, same for hyp.
3. If `ref_nospace == hyp_nospace`: return `(0, len(ref_words))` — pure spacing difference = zero errors.
4. Build `char_to_word` mapping: for each character position in `ref_nospace`, record which reference word index it belongs to.
5. Compute Levenshtein edit distance DP over characters of `ref_nospace` vs `hyp_nospace`.
6. Backtrack through the DP matrix. For each edit operation:
   - **Match**: mark nothing.
   - **Substitution**: mark the reference word owning that character.
   - **Deletion**: mark the reference word owning that deleted character.
   - **Insertion**: mark the nearest reference word (previous if available, else next).
7. Count distinct reference words touched by any edit. Each word counts at most once.
8. Return `(error_words, total_words)`.

**Corpus-level aggregation**: Micro-average — sum `error_words` and `total_words` across all samples, then divide.

**Key properties:**
- Forgives pure spacing differences (split/merge words) — `"new york" == "newyork"` scores 0 errors.
- One wrong character makes the entire reference word count as wrong (coarser than MER).
- Stricter than MER, more lenient than wer_norm.
- Hierarchy: `mer < space_norm_wer < wer_norm`

```python
# Example:
# ref: "भद्रादी कोत्तागुडेम और करीमनगर"
# hyp: "भद्रादी कोत्ता गुड़ेम और करीम नगर"
# → ref_nospace vs hyp_nospace differ only by ड→ड़ (one char edit)
# → touches word "कोत्तागुडेम" → error_words=1, total=4 → space_norm_wer=25%
# → wer_norm would be much higher (split words = multiple errors)
# → mer would be lower (just counts the one character)
```

---

## Schema: `metrics.json`

```jsonc
{
  "<language_name>": {
    "n_samples": 500,              // int
    "wer_raw": 42.10,             // float: % — minimal normalization
    "wer_norm": 34.07,            // float: % — primary metric
    "wer_numcanon": 32.50,        // float: % — with number canonicalization
    "space_norm_wer": 26.20,     // float: % — word-level after space-insensitive char alignment
    "mer": 12.40,                 // float: % — Meaningful Error Rate (CER on space-stripped text)
    "cer_norm": 12.30,            // float: % — character-level normalized
    "empty_hypotheses": 0,        // int: samples where model produced no output
    "normalization_delta": {
      "raw_to_norm": -8.03,       // float: wer_raw - wer_norm
      "norm_to_numcanon": -1.57,  // float: wer_norm - wer_numcanon
      "norm_to_space_norm": -7.87, // float: wer_norm - space_norm_wer (pure spacing inflation)
      "norm_to_mer": -21.67       // float: wer_norm - mer (spacing + word-granularity inflation)
    }
  },
  // ... one key per language (12 total) ...

  "__overall__": {
    "n_samples": 6000,
    "wer_raw": 40.75,
    "wer_norm": 32.07,
    "wer_numcanon": 30.80,
    "space_norm_wer": 24.10,
    "mer": 11.50,
    "cer_norm": 11.34
  },

  "__macro_avg__": {
    "n_languages": 12,
    "wer_raw": 40.52,
    "wer_norm": 31.76,
    "wer_numcanon": 30.20,
    "space_norm_wer": 23.80,
    "mer": 11.20,
    "cer_norm": 11.03
  },

  "__meta__": {
    "checkpoint": "/home/ubuntu/training/checkpoints/gemma3n-e2b-ckpt-20000",
    "checkpoint_name": "ckpt-20000",
    "model_id": "gemma3n-e2b",
    "model_type": "gemma3n-E2B-asr",
    "dataset": "BayAreaBoys/indic-asr-benchmark-6k",
    "batch_size": 128,
    "inference_time_sec": 723.7,
    "total_audio_sec": 40354.46,
    "rtf": 0.0179,
    "timestamp": "2026-03-26T14:30:00Z",
    "gpu": "NVIDIA A100 80GB",
    "framework": "transformers",
    "normalization_version": "v1",
    "jiwer_version": "3.1.0"
  }
}
```

### Required Fields

| Section          | Required                                                                                         |
|------------------|--------------------------------------------------------------------------------------------------|
| Per-language     | `n_samples`, `wer_raw`, `wer_norm`, `wer_numcanon`, `space_norm_wer`, `mer`, `cer_norm`, `empty_hypotheses`, `normalization_delta` |
| `__overall__`    | `n_samples`, `wer_raw`, `wer_norm`, `wer_numcanon`, `space_norm_wer`, `mer`, `cer_norm`         |
| `__macro_avg__`  | `n_languages`, `wer_raw`, `wer_norm`, `wer_numcanon`, `space_norm_wer`, `mer`, `cer_norm`       |
| `__meta__`       | `checkpoint_name`, `model_id`, `dataset`, `inference_time_sec`, `total_audio_sec`, `rtf`, `timestamp`, `normalization_version` |

---

## Schema: `sample_analysis.json`

Array of per-sample objects. One entry per sample (6000 total).

```jsonc
[
  {
    "id": "as_0000",
    "language": "assamese",
    "reference": "<ground truth>",
    "hypothesis": "<model output>",
    "ref_norm": "<after norm pipeline>",
    "hyp_norm": "<after norm pipeline>",
    "ref_numcanon": "<after numcanon pipeline>",
    "hyp_numcanon": "<after numcanon pipeline>",
    "ref_mer": "<after space removal>",
    "hyp_mer": "<after space removal>",
    "detected_language": "Assamese",
    "wer_raw": 45.0,
    "wer_norm": 30.0,
    "mer": 22.0,
    "flags": ["numeric_mismatch", "punctuation_only_diff", "spacing_error"]
  }
]
```

### Per-sample fields

| Field              | Type     | Required | Description                                              |
|--------------------|----------|----------|----------------------------------------------------------|
| `id`               | string   | Yes      | Unique sample ID                                         |
| `language`         | string   | Yes      | Language name, lowercase                                 |
| `reference`        | string   | Yes      | Ground truth transcription (original)                    |
| `hypothesis`       | string   | Yes      | Model prediction (original)                              |
| `ref_norm`         | string   | Yes      | Reference after norm pipeline                            |
| `hyp_norm`         | string   | Yes      | Hypothesis after norm pipeline                           |
| `ref_numcanon`     | string   | No       | Reference after numcanon pipeline                        |
| `hyp_numcanon`     | string   | No       | Hypothesis after numcanon pipeline                       |
| `ref_mer`          | string   | No       | Reference after space removal (single char stream)       |
| `hyp_mer`          | string   | No       | Hypothesis after space removal (single char stream)      |
| `detected_language`| string   | No       | Language detected by model                               |
| `wer_raw`          | float    | No       | Per-sample WER raw (%)                                   |
| `wer_norm`         | float    | No       | Per-sample WER normalized (%)                            |
| `mer`              | float    | No       | Per-sample MER — Meaningful Error Rate (%)               |
| `flags`            | string[] | No       | Diagnostic tags (see flag vocabulary below)              |

### Flag Vocabulary

Agents SHOULD tag samples with applicable flags:

| Flag                    | Meaning                                                      |
|-------------------------|--------------------------------------------------------------|
| `exact_match`           | reference == hypothesis (raw)                                |
| `exact_match_norm`      | ref_norm == hyp_norm                                         |
| `numeric_mismatch`      | Difference involves digits or number words                   |
| `punctuation_only_diff` | Raw differs but norm matches — punctuation was the only gap  |
| `empty_hypothesis`      | Model produced no output                                     |
| `script_mismatch`       | Hypothesis uses a different script than reference            |
| `lang_confusion`        | detected_language != expected language                        |
| `high_wer`              | wer_norm > 80%                                               |
| `entity_mismatch`       | Named entity (proper noun) was misrecognized                 |
| `spacing_error`         | wer_norm > mer — content is correct but word boundaries wrong|

---

## Schema: `error_analysis.json`

Per-language error breakdown. One file per checkpoint.

```jsonc
{
  "<language_name>": {
    "top_substitutions": [
      {"ref": "word_a", "hyp": "word_b", "count": 42},
      // top 20 by count
    ],
    "top_insertions": [
      {"word": "um", "count": 15},
      // top 20
    ],
    "top_deletions": [
      {"word": "the", "count": 30},
      // top 20
    ],
    "error_buckets": {
      "numeric_mismatch_count": 23,
      "punctuation_only_count": 45,
      "spacing_tokenization_count": 12,
      "entity_mismatch_count": 8,
      "script_confusion_count": 0,
      "empty_hypothesis_count": 0
    },
    "examples": {
      "worst_samples": ["as_0047", "as_0231", "as_0419"],
      "best_samples": ["as_0001", "as_0155", "as_0302"],
      "numeric_mismatch_samples": ["as_0089", "as_0201"],
      "entity_mismatch_samples": ["as_0134"]
    }
  },
  // ... one key per language ...

  "__summary__": {
    "model_diagnosis": "recognition-limited",
    "primary_error_source": "recognition",
    "numeric_verbalization_impact": "moderate",
    "formatting_impact": "low",
    "worst_languages": ["assamese", "malayalam", "kannada"],
    "best_languages": ["hindi", "english", "gujarati"]
  }
}
```

### `__summary__.model_diagnosis` values

| Value                 | Meaning                                           |
|-----------------------|---------------------------------------------------|
| `recognition-limited` | Core ASR errors dominate, normalization helps little |
| `formatting-limited`  | Most errors are punctuation/casing/spacing         |
| `numeric-limited`     | Number verbalization is the primary error source   |
| `mixed`               | No single dominant error category                  |

### Required Fields

| Section        | Required                                                                    |
|----------------|-----------------------------------------------------------------------------|
| Per-language   | `top_substitutions` (>=10), `top_insertions` (>=10), `top_deletions` (>=10), `error_buckets` |
| `__summary__`  | `model_diagnosis`, `primary_error_source`, `worst_languages`, `best_languages` |

---

## Agent Instructions

### Output directory

```
/home/ubuntu/training/benchmark_outputs/<model_id>/<checkpoint_name>/
```

### Files to generate per checkpoint

| File                   | Required | Content                     |
|------------------------|----------|-----------------------------|
| `metrics.json`         | Yes      | All six metric tiers        |
| `sample_analysis.json` | Yes      | Per-sample predictions      |
| `error_analysis.json`  | Yes      | Error breakdown + diagnosis |

### Critical Rules

1. `__meta__.model_id` MUST match the directory `<model_id>` exactly.
2. `__meta__.timestamp` MUST be ISO 8601 UTC of when the run completed.
3. `__meta__.normalization_version` MUST be `"v1"` for this schema version.
4. All six metric tiers (`wer_raw`, `wer_norm`, `wer_numcanon`, `space_norm_wer`, `mer`, `cer_norm`) are mandatory per language and in aggregates.
5. `normalization_delta` must include `raw_to_norm`, `norm_to_numcanon`, `norm_to_space_norm`, and `norm_to_mer`.
6. Use `jiwer` for WER/CER. `__overall__` = micro-average. `__macro_avg__` = mean of per-language values.
7. Save intermediate normalized texts in `sample_analysis.json` (`ref_norm`, `hyp_norm`) so results are auditable.

### Model-Specific Notes

| Model            | model_id        | Checkpoint paths                                              |
|------------------|-----------------|---------------------------------------------------------------|
| Qwen3-ASR        | `qwen3-asr`     | `/home/ubuntu/training/checkpoints/qwen3-asr-ckpt-{N}`       |
| Gemma3n-E2B      | `gemma3n-e2b`   | `/home/ubuntu/training/checkpoints/gemma3n-e2b-ckpt-{N}`     |
| Parakeet-1.1B    | `parakeet-1.1b` | TBD — use `baseline` as checkpoint_name                       |

### Metric Roles

| Role                           | Metric           | Use for                                              |
|--------------------------------|------------------|------------------------------------------------------|
| Primary research metric        | `wer_norm`       | Model comparison, paper reporting, dashboard default  |
| Strict production metric       | `wer_raw`        | Transcript fidelity, formatting quality               |
| Numeric robustness metric      | `wer_numcanon`   | Isolating number verbalization errors                 |
| Space-insensitive word metric  | `space_norm_wer` | Word accuracy ignoring segmentation — spacing forgiven, content errors still word-level |
| Content accuracy metric        | `mer`            | CER on space-stripped text — purest content accuracy  |
| Script-sensitive metric        | `cer_norm`       | Indic script evaluation, character-level quality      |

---

## Dashboard Integration

### nginx endpoint (port 80 on 216.81.248.184)

```
GET /api/benchmarks/                                   → model directory listing (JSON autoindex)
GET /api/benchmarks/<model_id>/                        → checkpoint listing
GET /api/benchmarks/<model_id>/<ckpt>/metrics.json     → metrics file
GET /api/benchmarks/<model_id>/<ckpt>/sample_analysis.json
GET /api/benchmarks/<model_id>/<ckpt>/error_analysis.json
```

### Next.js API routes (on Vercel)

```
GET /api/models                                → aggregated model + checkpoint list
GET /api/metrics?model=X&checkpoint=Y          → metrics.json
GET /api/samples?model=X&checkpoint=Y&language=Z&page=N&pageSize=N
GET /api/errors?model=X&checkpoint=Y           → error_analysis.json
GET /api/compare                               → cross-model WER comparison
```

Agents place files in `benchmark_outputs/`. Dashboard picks them up on next request. No restart or redeploy needed.

---

## Migration from Old Schema

Old schema fields map to new as follows:

| Old Field          | New Field        |
|--------------------|------------------|
| `wer`              | `wer_raw`        |
| `cer`              | (dropped — use `cer_norm`) |
| `wer_normalized`   | `wer_norm`       |
| `cer_normalized`   | `cer_norm`       |
| (did not exist)    | `wer_numcanon`   |
| (did not exist)    | `space_norm_wer` |
| (did not exist)    | `mer`            |
| (did not exist)    | `normalization_delta` |
| (did not exist)    | `error_analysis.json` |

Agents must regenerate results using the new schema. Old `benchmark_results/` and `benchmark_results_gemma3n/` directories are retained for reference but are NOT consumed by the dashboard.