# Maya ASR: FastConformer + Gemma Transcription Build Guide

## Executive Summary

Build a speech-to-text transcription system using a FastConformer encoder coupled with a small Gemma 3 decoder, trained at full-parameter scale on ~150K hours of transcribed Indic+English data across 12 languages. **Task: transcription only** (no translation). Covers architecture design, tokenizer construction, multi-stage training recipes, distributed training on 8×H100 GPUs, and concrete NeMo scripts and config references.

---

## 1. Architecture Overview

### 1.1 High-Level Design

```
┌──────────────────────────────────────────────────────────────────┐
│                    Maya ASR (~800M–1.2B)                         │
│                                                                  │
│  ┌───────────────────────┐    ┌──────────┐    ┌──────────────┐   │
│  │  FastConformer Encoder │───►│Projection│───►│ Gemma 3      │   │
│  │  (IndicConformer-600M)│    │  Layer   │    │ (small)      │   │
│  │  32 layers, 1024 dim  │    │ Linear   │    │ Decoder      │   │
│  │  8x subsampling       │    │ MLP      │    │              │   │
│  └───────────────────────┘    └──────────┘    └──────────────┘   │
│                                                                  │
│  Language Prompting: <|source_lang|> token guides decoder        │
│  Task: <|transcribe|> only                                       │
└──────────────────────────────────────────────────────────────────┘
```

### 1.2 Component Breakdown

**Encoder: FastConformer (IndicConformer-600M base)**

Based on [AI4Bharat's IndicConformer-600M multilingual model](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual), a Conformer-based hybrid CTC+RNNT model supporting all 22 scheduled Indian languages.

- Architecture: FastConformer with 8x depthwise convolutional subsampling (256 channels)
- Reduced convolutional kernel size of 9 in conformer blocks
- 2.4x faster training/inference than standard Conformer without quality degradation
- Supports local attention for audio >1 hour
- NeMo-native: `EncDecHybridRNNTCTCBPEModel` class

Scale to **32 encoder layers at 1024 hidden dimension** (matching Canary-1B-Flash's encoder spec). Initialize from IndicConformer-600M checkpoint and expand layers if needed.

Reference architecture table:

| Model | Params | encoder.n_layers | enc_hidden |
|---|---|---|---|
| Canary-1B-Flash | 883M | 32 | 1024 |
| Canary-180M-Flash | 182M | 17 | 512 |
| IndicConformer-600M | 600M | — | — |

**Recommended encoder config**: 32 layers, 1024 hidden (Canary-1B-Flash pattern), initialized from IndicConformer-600M weights.

**Decoder: Gemma 3 (Small Variant)**

Use the smallest viable Gemma 3 variant as a decoder-only LLM. Candidates:

| Variant | Params | Hidden Dim | Layers | Notes |
|---|---|---|---|---|
| `google/gemma-3-1b-pt` | 1B | 1536 | 26 | Smallest official Gemma 3 |
| `google/gemma-2-2b` | 2.6B | 2304 | 26 | Gemma 2, good Indic but larger |

**Primary choice: Gemma 3 1B** — it is already the smallest Gemma 3 release. Its 262K SentencePiece tokenizer has excellent Indic coverage:

| Language | Gemma 3 Fertility | Dedicated Tokens |
|---|---|---|
| Hindi | 1.38 | 4,391 Devanagari tokens |
| Bengali | 1.74 | 541 tokens |
| Tamil | 2.42 | 724 tokens |
| Telugu | 2.93 | 404 tokens |
| Kannada | 3.15 | 231 tokens |
| Malayalam | 3.39 | ~200 tokens (est.) |

With 150K hours, full instruction tuning (not LoRA) is the right approach — the decoder needs to deeply learn acoustic-to-text alignment for all 12 languages.

If Gemma 3 1B proves too heavy, consider distilling down or using only a subset of its layers (e.g., first 16 of 26 layers) as a smaller decoder, keeping the tokenizer.

**Projection Layer: Linear MLP**

Linear projection outperforms Q-Former for speech-LLM coupling ([MLC-SLM 2025](https://arxiv.org/html/2601.01461v1): Linear 11.05% WER vs Q-Former 11.51%).

Design: `Conv1D(stride=2) → LayerNorm → Linear(1024 → 1536) → SiLU → Linear(1536 → 1536)`

This downsamples encoder output by 2x temporally (on top of 8x from FastConformer), then maps from encoder dimension (1024) to Gemma 3 1B's hidden_size (1536).

---

## 2. Local Data

### 2.1 Data Location

All data is pre-downloaded at `/root/sft_data/` (~13 TiB).

### 2.2 Data Sources

| Source | Path | Languages | Description |
|---|---|---|---|
| **final-export** | `/root/sft_data/final-export/production/shards/` | 12 (as, bn, en, gu, hi, kn, ml, mr, or, pa, ta, te) | Primary dataset, largest |
| **indicvoices** | `/root/sft_data/indicvoices/` | 11 Indic | AI4Bharat public benchmark |
| **indicvoices-r** | `/root/sft_data/indicvoices-r/` | 11 Indic | High-quality Indic speech |
| **josh** | `/root/sft_data/josh/` | 5 (bn, en, gu, hi, mr) | Indic content |
| **joshdelivery** | `/root/sft_data/joshdelivery/` | 5 | Additional Indic data |
| **globe** | `/root/sft_data/globe/` | Multi | Multilingual data |
| **ears** | `/root/sft_data/ears/` | en | English |
| **expresso** | `/root/sft_data/expresso/` | en | English |
| **librittsr** | `/root/sft_data/librittsr/` | en | English TTS-quality |
| **ljspeech** | `/root/sft_data/ljspeech/` | en | English speech |
| **vctk** | `/root/sft_data/vctk/` | en | Multi-speaker English |

### 2.3 Shard Structure (final-export)

Organized as `lang={code}/{shard_id}/` with each shard containing:
- `audio.tar` — compressed FLAC audio files (~15,000 segments per shard)
- `audio_index.parquet` — audio indexing metadata
- `manifest.json` — shard-level metadata (counts, checksums, language)
- `metadata.parquet` — per-segment metadata (transcription, duration, speaker, domain)
- `xcodec2_tokens.parquet` — pre-computed acoustic tokens (optional)

**Shard counts per language (final-export):**

| Language | Shards | Est. Hours |
|---|---|---|
| English (en) | 840 | ~28K |
| Hindi (hi) | 475 | ~16K |
| Telugu (te) | 400 | ~13K |
| Malayalam (ml) | 372 | ~12K |
| Punjabi (pa) | 343 | ~11K |
| Tamil (ta) | 312 | ~10K |
| Kannada (kn) | 200 | ~7K |
| Gujarati (gu) | 186 | ~6K |
| Bengali (bn) | 158 | ~5K |
| Marathi (mr) | 148 | ~5K |
| Odia (or) | 78 | ~3K |
| Assamese (as) | 39 | ~1K |
| **Total** | **3,551** | **~117K** |

### 2.4 Data Format Conversion

The local data needs to be converted to NeMo manifest format for training:

```json
{"audio_filepath": "/data/hindi/audio_001.wav", "text": "नमस्ते दुनिया", "duration": 5.2, "lang": "hi", "taskname": "asr", "source_lang": "hi", "target_lang": "hi"}
```

**Step 1: Extract per-segment manifests from metadata parquet files**

Write a script to iterate all shards, read `metadata.parquet`, and produce NeMo-compatible JSON-line manifests pointing into the `audio.tar` files (or extracted audio).

**Step 2: Convert to tarred WebDataset format** (recommended for 150K hours):

```bash
python scripts/speech_recognition/convert_to_tarred_audio_dataset.py \
  --manifest_path=train_manifest.json \
  --target_dir=/data/tarred/hindi/ \
  --num_shards=512 \
  --max_duration=40.0 \
  --min_duration=0.01
```

Since audio is already tarred per-shard, we may be able to adapt existing tars directly or write a custom `LhotseDataset` that reads from the existing shard format.

### 2.5 Multi-Corpus Input Configuration

Use NeMo's `input_cfg` YAML format:

```yaml
input_cfg:
  - corpus: final_export_hindi
    language: hi
    manifest_filepath: /data/nemo_manifests/hi/manifest.json
    tarred_audio_filepaths: /data/nemo_tarred/hi/audio_*.tar
    type: nemo_tarred
    weight: 10000
  - corpus: final_export_bengali
    language: bn
    manifest_filepath: /data/nemo_manifests/bn/manifest.json
    tarred_audio_filepaths: /data/nemo_tarred/bn/audio_*.tar
    type: nemo_tarred
    weight: 10000
  # ... repeat for all 12 languages + supplementary sources
```

### 2.6 Language Balancing

Use **two-stage deterministic upsampling** ([Polyglot-Lion, March 2026](https://arxiv.org/html/2603.16184v1)):

1. **Intra-language**: Upsample smaller corpora within each language to match the largest corpus
2. **Inter-language**: Upsample each language to match the largest language group

This produced 72% relative WER reduction for underrepresented languages without degrading high-resource ones.

Alternative: Temperature-based sampling (α = 0.3–0.5) as used by [Canary-1B-v2](https://arxiv.org/html/2509.14128v2).

### 2.7 Data Augmentation

At 150K hours, heavy augmentation is less critical but still beneficial:

- **Speed perturbation** (0.9x, 1.0x, 1.1x): Standard 3x multiplier
- **SpecAugment**: Frequency masking (F=27, num_masks=2) + time masking (T=0.05×duration, num_masks=10)
- **Noise augmentation**: Environmental noise at SNRs 5–20 dB (MUSAN or Indic-environment noise)

---

## 3. Tokenizer Design

### 3.1 Decoder Tokenizer: Gemma 3's Tokenizer

Use Gemma 3's existing 262K SentencePiece tokenizer directly. It already has strong Indic coverage with 4K+ dedicated Indic tokens. No custom tokenizer training needed for the decoder side.

Add special tokens:
- `<|transcribe|>` — task token
- Language ID tokens: `<|hi|>`, `<|bn|>`, `<|ta|>`, `<|te|>`, `<|gu|>`, `<|kn|>`, `<|ml|>`, `<|mr|>`, `<|pa|>`, `<|or|>`, `<|as|>`, `<|en|>`

### 3.2 Encoder-Side Tokenizer (for CTC Auxiliary Loss)

Train a smaller custom Indic BPE tokenizer (4K–8K vocab) for the CTC auxiliary loss on the encoder:

```bash
python scripts/tokenizers/process_asr_text_tokenizer.py \
  --manifest=balanced_text_corpus.json \
  --vocab_size=8192 \
  --tokenizer_type=bpe \
  --spe_type=unigram \
  --spe_character_coverage=0.9999 \
  --spe_split_by_unicode_script=true \
  --output_dir=/tokenizers/encoder_8k/
```

---

## 4. Training Recipe

### 4.1 Stage 0: Encoder Pretraining / Warm-Start

Start from [IndicConformer-600M multilingual](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual):

```python
import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained(
    "ai4bharat/indic-conformer-600m-multilingual"
)
```

If scaling encoder to 32 layers, initialize new layers from existing weights via interpolation or duplication, then continue pretraining with CTC+RNNT loss for 50K–100K steps.

Alternative: Initialize from [Canary-1B-Flash encoder](https://huggingface.co/nvidia/canary-1b-flash) (32 layers, 1024 dim):

```yaml
init_from_pretrained_model:
  model0:
    name: "nvidia/canary-1b-flash"
    include: ["encoder"]
    exclude: ["encoder.pre_encode.out"]
```

### 4.2 Stage 1: Encoder Fine-Tuning on Indic Data (CTC+RNNT)

Fine-tune the encoder on all ~150K hours using hybrid CTC+RNNT loss. This creates a strong Indic speech feature extractor before coupling with the LLM decoder.

```bash
python examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe_prompt.py \
  --config-path=examples/asr/conf/fastconformer/hybrid_transducer_ctc/ \
  --config-name=fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml \
  model.train_ds.manifest_filepath=/data/train_manifest.json \
  model.validation_ds.manifest_filepath=/data/val_manifest.json \
  model.tokenizer.dir=/tokenizers/encoder_8k/ \
  model.train_ds.use_lhotse=true \
  model.train_ds.input_cfg=/configs/input_cfg.yaml \
  trainer.devices=8 \
  trainer.num_nodes=1 \
  trainer.max_steps=200000 \
  trainer.val_check_interval=5000 \
  model.optim.name=adamw \
  model.optim.lr=0.001 \
  model.optim.betas=[0.9,0.98] \
  model.optim.weight_decay=0.0001 \
  model.optim.sched.name=CosineAnnealing \
  model.optim.sched.warmup_steps=5000 \
  trainer.precision=bf16-mixed
```

**Duration**: ~5–10 days on 8×H100 for 200K steps.

Key insight from MLC-SLM research: "fine-tuning speech encoders on in-domain data before integrating them into the LLM is more effective than the tri-stage pipeline" ([arXiv](https://arxiv.org/html/2601.01461v1)).

### 4.3 Stage 2: Attach Gemma Decoder + Joint Training

Two-phase approach:

**Stage 2a: Train projection only (encoder + LLM frozen), ~20K steps**

```python
import torch
from transformers import GemmaForCausalLM, GemmaTokenizer
import nemo.collections.asr as nemo_asr

# Load pretrained components
encoder = load_indicconformer_encoder("stage1_checkpoint.nemo")
llm = GemmaForCausalLM.from_pretrained("google/gemma-3-1b-pt")
tokenizer = GemmaTokenizer.from_pretrained("google/gemma-3-1b-pt")

# Define projection layer
projection = nn.Sequential(
    nn.Conv1d(1024, 1024, kernel_size=2, stride=2),  # 2x temporal downsampling
    nn.LayerNorm(1024),
    nn.Linear(1024, 1536),  # enc_dim → gemma-3-1b hidden_dim
    nn.SiLU(),
    nn.Linear(1536, 1536),
)

# Stage 2a: projection only
for batch in dataloader:
    audio_features = encoder(batch.audio)       # frozen
    projected = projection(audio_features)       # trainable

    # Prepend language + task tokens: <|hi|><|transcribe|>
    prompt_embeds = llm.embed_tokens(batch.prompt_ids)
    input_embeds = torch.cat([projected, prompt_embeds], dim=1)

    logits = llm(inputs_embeds=input_embeds)     # frozen
    loss = cross_entropy(logits, batch.target_ids)
    loss.backward()
```

**Stage 2b: Joint training (projection + full LLM, encoder frozen or LoRA), ~100K–200K steps**

- LR: 1e-5 for LLM, 1e-4 for projection
- Use DeepSpeed ZeRO-2 or FSDP for memory efficiency
- Cross-entropy loss on transcription tokens only

**SKIP-SALSA reference** ([Interspeech 2025](https://www.isca-archive.org/interspeech_2025/mittal25_interspeech.pdf)):

Designed for scenarios where the LLM has much better tokenization than the ASR decoder (exactly our case). Showed up to 20% WER improvement on IndicVoices data. At 150K hour scale, full parameter training is preferred over SKIP-SALSA's frozen-encoder approach, but the projection design is valuable inspiration.

Code reference: [github.com/csalt-research/salsa](https://github.com/csalt-research/salsa)

### 4.4 Stage 3: Quality Fine-Tuning

Fine-tune on a **language-balanced, high-quality subset**:

- Select cleanest 10% of data per language
- Train for 10K–20K additional steps
- Lower learning rate (0.1× of Stage 2 peak)
- This stage polishes accuracy and reduces hallucination

---

## 5. Distributed Training Configuration

### 5.1 Hardware: 8×H100 80GB

For a ~800M–1.2B parameter model (600M encoder + projection + Gemma decoder):
- **Single H100**: 80GB VRAM, sufficient for inference but tight for full training
- **8×H100**: DDP as default for encoder-only stages. DeepSpeed ZeRO-2 for joint training with Gemma

### 5.2 NeMo Trainer Config (Encoder Stages)

```yaml
trainer:
  devices: 8
  num_nodes: 1
  accelerator: gpu
  strategy: ddp
  precision: bf16-mixed
  max_steps: 200000
  val_check_interval: 5000
  accumulate_grad_batches: 4
  gradient_clip_val: 1.0
  log_every_n_steps: 100
```

### 5.3 DeepSpeed ZeRO-2 (Joint Training with Gemma)

```json
{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {"device": "none"},
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "overlap_comm": true
  },
  "bf16": {"enabled": true},
  "gradient_accumulation_steps": 4,
  "gradient_clipping": 1.0,
  "train_micro_batch_size_per_gpu": 8,
  "wall_clock_breakdown": false
}
```

ZeRO-2 shards optimizer + gradients (~3x memory reduction). ZeRO-3 only needed if >3B params.

### 5.4 Training Time Estimation

| Stage | Steps | Time/Step (8×H100) | Total |
|---|---|---|---|
| Stage 1: Encoder CTC+RNNT | 200K | ~2–3s | ~5–7 days |
| Stage 2a: Projection warmup | 20K | ~3s | ~1 day |
| Stage 2b: Full joint training | 150K | ~4–6s | ~7–10 days |
| Stage 3: Quality fine-tune | 20K | ~4s | ~1 day |
| **Total** | | | **~14–19 days** |

### 5.5 Training Optimizations

- **bf16 mixed precision**: Standard on H100, 2× memory reduction + TF32 tensor core acceleration
- **Activation checkpointing**: Essential for 32-layer encoder — rematerialize activations during backward pass
- **Lhotse dynamic bucketing**: Groups samples by duration, minimizing padding waste:
  ```yaml
  train_ds:
    use_lhotse: true
    use_bucketing: true
    num_buckets: 30
    bucket_buffer_size: 20000
    shuffle_buffer_size: 10000
    max_duration: 40.0
    min_duration: 0.01
  ```
  Use NeMo's [2D duration bin estimation script](https://github.com/NVIDIA-NeMo/NeMo/blob/main/scripts/speech_recognition/estimate_duration_bins_2d.py)
- **Tarred audio datasets**: Pre-tar into WebDataset shards (512–1024 per language). Our data is already shard-tarred.
- **Gradient accumulation**: accumulate_grad_batches=4–8 for large effective batch sizes
- **Cosine annealing scheduler**: Standard for FastConformer training

---

## 6. Evaluation

### 6.1 Benchmarks

| Benchmark | Languages | Domain | Source |
|---|---|---|---|
| [Vistaar](https://github.com/AI4Bharat/vistaar) | 12 Indic | 59 benchmarks: news, education, tourism | AI4Bharat |
| [IndicSUPERB](https://ai4bharat.iitm.ac.in/) | 22 Indic | Standardized clean + noisy splits | AI4Bharat |
| [FLEURS](https://huggingface.co/datasets/google/fleurs) | Indic subset | Read speech | Google |
| [CommonVoice](https://commonvoice.mozilla.org/) | Multiple Indic | Crowd-sourced read speech | Mozilla |
| [Kathbath](https://github.com/AI4Bharat/vistaar) | 12 Indic | Conversational | AI4Bharat |

### 6.2 Metrics

- **WER** (Word Error Rate): Primary metric for all languages
- **CER** (Character Error Rate): Secondary for morphologically rich languages (Tamil, Malayalam, Telugu)
- **RTFx** (Real-Time Factor, inverse): Throughput measurement
- Per-language breakdown — never average across languages without reporting individual scores

### 6.3 Baselines to Compare Against

| Model | Best Hindi WER | Notes |
|---|---|---|
| [Google STT](https://arxiv.org/html/2602.03868v1) | 16.2% (agricultural) | Commercial |
| [IndicWhisper](https://github.com/AI4Bharat/vistaar) | 13.6% avg (Vistaar) | Fine-tuned Whisper, 12 Indic |
| [IndicConformer-600M](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual) | Varies by lang | 22 Indic, NeMo-native |
| [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) | SOTA open-source | Hindi only among Indic |

---

## 7. Avoiding Catastrophic Forgetting

With 12 languages trained jointly, forgetting is less of a concern. Key strategies:

- **Joint training** (all languages simultaneously): Best approach. No forgetting by design.
- **Language-balanced sampling**: Critical — without it, high-resource languages dominate
- **If adding languages later**: Use Learning without Forgetting (LwF) — consistently low and stable WER, best BWT in noisy settings
- **CTC auxiliary loss**: Acts as implicit regularizer, stabilizing encoder representations across languages

Code: [github.com/FrozenWolf-Cyber/Indic-CL-ASR](https://github.com/FrozenWolf-Cyber/Indic-CL-ASR)

---

## 8. Key References and Code

### 8.1 Training Scripts

| Script | Path | Purpose |
|---|---|---|
| Hybrid CTC+RNNT Training | `examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe_prompt.py` | Encoder training with prompt support |
| AED Multitask Training | `examples/asr/speech_multitask/speech_to_text_aed.py` | Canary-style training (reference) |
| CTC Language Fine-tuning | `tutorials/asr/ASR_CTC_Language_Finetuning.ipynb` | Tutorial for adding new languages |
| Tokenizer Training | `scripts/tokenizers/process_asr_text_tokenizer.py` | Build SentencePiece tokenizer |
| Duration Bin Estimation | `scripts/speech_recognition/estimate_duration_bins_2d.py` | Optimize Lhotse bucketing |
| SALSA/SKIP-SALSA | `github.com/csalt-research/salsa` | ASR-LLM synchronous coupling |

### 8.2 Config Files

| Config | Path | Purpose |
|---|---|---|
| FastConformer AED | `examples/asr/conf/speech_multitask/fast-conformer_aed.yaml` | Canary-style config (reference) |
| FastConformer CTC | `examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml` | CTC encoder training |
| FastConformer RNNT | `examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml` | Transducer training |
| Hybrid CTC+RNNT Prompt | `examples/asr/conf/fastconformer/hybrid_transducer_ctc/fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml` | Hybrid with prompt support |
| Long-form CTC | `examples/asr/conf/fastconformer/fast-conformer-long_ctc_bpe.yaml` | Longformer attention for long audio |

### 8.3 Key Papers

- [Canary-1B-v2 & Parakeet-TDT-0.6B-v3 Technical Report](https://arxiv.org/html/2509.14128v2) — NVIDIA's training methodology
- [SALSA: Speedy ASR-LLM Synchronous Aggregation](https://arxiv.org/abs/2408.16542) — ASR+LLM coupling via projection layers
- [SKIP-SALSA](https://www.isca-archive.org/interspeech_2025/mittal25_interspeech.pdf) — Handles token fertility gap for Indic languages
- [Bridging the Gap: Speech-LLM vs E2E ASR](https://arxiv.org/html/2601.01461v1) — Two-stage training beats tri-stage
- [Polyglot-Lion](https://arxiv.org/html/2603.16184v1) — Balanced multilingual sampling
- [Indic CL-ASR](https://arxiv.org/abs/2508.06280) — Continual learning for Indian language ASR
- [Slam: Training SLMs on One GPU in a Day](https://arxiv.org/html/2502.15814v1) — Efficient speech-LLM training
- [Vistaar/IndicWhisper](https://arxiv.org/abs/2305.15386) — Indic training data and benchmarks

### 8.4 NeMo Installation

```bash
# Install AI4Bharat's NeMo fork (for IndicConformer compatibility)
git clone https://github.com/AI4Bharat/NeMo.git
cd NeMo && git checkout nemo-v2 && bash reinstall.sh

# Or install official NeMo (for Canary-style training)
pip install nemo_toolkit[asr]

# For Speech-LLM integration with Gemma
pip install transformers accelerate deepspeed
```
