---
pretty_name: Indic ASR Benchmark 6K
task_categories:
- automatic-speech-recognition
language:
- as
- bn
- en
- gu
- hi
- kn
- ml
- mr
- or
- pa
- ta
- te
size_categories:
- 1K<n<10K
---

# Indic ASR Benchmark 6K

A multilingual ASR benchmark assembled for evaluation across 12 languages.

## Overview

- Total samples: 6000
- Total duration: 11.21 hours
- Languages: 12
- Samples per language: 500
- Seed: 42

## Sources

- `Kathbath (test_known)`: AI4Bharat Kathbath benchmark, 10 Indic languages
- `Svarah`: AI4Bharat Svarah, Indic-accented English
- `IndicVoices (valid)`: AI4Bharat IndicVoices validation set, Assamese

## Per-language breakdown

- assamese (`as`): 500 samples, 0.92 hr, source: IndicVoices (valid)
- bengali (`bn`): 500 samples, 0.89 hr, source: Kathbath (test_known)
- english (`en`): 500 samples, 0.73 hr, source: Svarah
- gujarati (`gu`): 500 samples, 0.83 hr, source: Kathbath (test_known)
- hindi (`hi`): 500 samples, 0.80 hr, source: Kathbath (test_known)
- kannada (`kn`): 500 samples, 1.12 hr, source: Kathbath (test_known)
- malayalam (`ml`): 500 samples, 1.42 hr, source: Kathbath (test_known)
- marathi (`mr`): 500 samples, 0.92 hr, source: Kathbath (test_known)
- odia (`or`): 500 samples, 0.84 hr, source: Kathbath (test_known)
- punjabi (`pa`): 500 samples, 0.77 hr, source: Kathbath (test_known)
- tamil (`ta`): 500 samples, 0.93 hr, source: Kathbath (test_known)
- telugu (`te`): 500 samples, 1.04 hr, source: Kathbath (test_known)

## Columns

- `id`: sample id derived from filename
- `filename`: original filename
- `reference`: reference transcript
- `language`: language name
- `primary_language`: normalized primary language field
- `lang_code`: short language code
- `source`: dataset source for the sample
- `duration`: duration in seconds
- `speaker_id`: speaker id when available
- `scenario`: scenario / prompt type when available
- `m4a_path`: original local m4a path when available
- `audio`: embedded audio column

## Usage

```python
from datasets import load_dataset

ds = load_dataset("BayAreaBoys/indic-asr-benchmark-6k", split="train")
print(ds[0])
```

## Note

This dataset was assembled from local benchmark data for testing and evaluation. Verify upstream source licenses and redistribution terms before downstream reuse.
