# Hindi Nemotron ASR -- Streaming Speech Recognition

Production real-time Hindi ASR service built on NVIDIA Nemotron-0.6B, fine-tuned for Hindi,
deployed on Modal with TensorRT-optimized encoder.

## Current Production Deployment

| Setting | Value |
|:---|:---|
| App | `hindi-nemotron-asr` |
| URL | `https://mayaresearch--hindi-nemotron-asr-nemotronasr-webapp.modal.run` |
| Frontend | `https://mayaresearch--hindi-nemotron-asr-webserver-web.modal.run` |
| Deploy file | `nemotron_asr/nemotron_asr_trt.py` |
| GPU | L40S ($1.60/hr) |
| Encoder | TensorRT FP16 (2.84x over PyTorch eager) |
| Model | `BayAreaBoys/nemotron-hindi` (fine-tuned Nemotron-0.6B) |
| Pipeline | Cache-aware RNNT streaming (NeMo) |
| target_inputs | 200 |
| max_inputs | 400 |
| Containers | 2 min, 1 buffer, 5 max |

### Deploy

```bash
source venv/bin/activate
modal deploy nemotron_asr/nemotron_asr_trt.py
```

First container startup takes ~5 min (ONNX export + TRT engine build). The engine is cached
to the Modal Volume -- subsequent starts load in ~90s.

### Endpoints

- `GET /health` -- health check, returns GPU, VRAM, encoder type, active streams
- `WS /ws` -- production WebSocket streaming (real-time audio in, transcription out)
- `POST /transcribe` -- HTTP batch transcription (JSON base64 or multipart file upload)

### WebSocket Protocol

```
Client -> Server: binary audio chunks (raw PCM 16-bit 16kHz mono)
Client -> Server: text "END" to signal end of audio
Server -> Client: JSON {"text": "...", "is_final": bool, "timestamp": float}
Server -> Client: text "END" to confirm stream complete
```

## Benchmarked Performance (L40S, WebSocket real-time)

Independent benchmark run (Feb 12, 2026) using FLEURS Hindi test set with WER scoring.

| Concurrency | WER | CER | FT p50 | FT p99 | Throughput | Errors |
|---:|---:|---:|---:|---:|---:|---:|
| 1 | 37.5% | 21.6% | 676ms | 676ms | 0.9x | 0 |
| 10 | 19.9% | 9.5% | 990ms | 1016ms | 6.4x | 0 |
| 50 | 33.7% | 19.1% | 601ms | 649ms | 23.5x | 0 |
| 100 | 34.3% | 19.8% | 781ms | 2351ms | 40.8x | 0 |
| 200 | 34.5% | 19.9% | 867ms | 2761ms | 54.0x | 0 |
| 300 | 33.6% | 19.0% | 983ms | 1713ms | 61.4x | 0 |
| 400 | 33.7% | 19.1% | 1023ms | 1859ms | 60.7x | 0 |
| 500 | 35.4% | 20.9% | 1132ms | 6737ms | 67.2x | 0 |

- **Safe zone**: c=1 to c=400 -- flat 33-34% WER, 0 errors
- **VRAM**: constant 10.4 / 48 GB (22%) regardless of concurrency
- **Cost for 1000 steady-state users**: 5 containers x $1.60 = $8.00/hr

### Autoscaling Behavior

| Total Users | Containers | Per-Container Load | FT p50 | WER |
|---:|---:|---:|---:|---:|
| 1-200 | 1 (+1 idle) | up to 200 | ~870ms | 33-34% |
| 201-400 | 2 | ~200 each | ~870ms | 33-34% |
| 401-600 | 3 | ~200 each | ~870ms | 33-34% |
| 601-1000 | 4-5 | ~200 each | ~870ms | 33-34% |
| Burst 1000-2000 | 5 (at 400 each) | up to 400 | ~1070ms | 33-34% |
| >2000 | 5 (capped) | 400 (rejecting) | -- | 1013 fast-fail |

## Repo Structure

```
nemo_hindi/
  nemotron_asr/              # Core application (deployed to Modal)
    nemotron_asr_trt.py      # ** PRODUCTION DEPLOY FILE ** -- TRT encoder, Modal app
    nemotron_asr.py           # Legacy torch.compile version (not deployed)
    pipelines.py              # NeMo CacheAwareRNNTStreamingPipeline + builder
    asr_utils.py              # Audio preprocessing (PCM, resampling, ffmpeg)
    multi_stream_utils.py     # Batched request streaming for concurrent sessions
    vad.py                    # CPU-based Silero VAD (silence filtering)
    bench_modal.py            # Benchmark script (WER/CER, concurrency sweep)
  nemotron-asr-frontend/     # Browser frontend (served by Modal WebServer)
    index.html                # Main page
    cache-aware-stt.js        # WebSocket client for streaming ASR
    audio-processor.js        # AudioWorklet for mic capture + PCM encoding
  model_cache/               # Local model files (not deployed -- Modal downloads from HF)
    final_model.nemo          # Fine-tuned Hindi Nemotron checkpoint
  .env                        # Environment variables (HF_TOKEN, etc.)
  venv/                       # Python virtual environment
```

### Key Files

**`nemotron_asr/nemotron_asr_trt.py`** -- The production deployment file. Contains:
- `TRTEncoderWrapper` -- Drop-in TRT replacement for NeMo conformer encoder.
  Overrides `cache_aware_stream_step()` to route through TRT engine instead of
  PyTorch. FP16 internal computation, FP32 I/O, dynamic batch/time shapes.
- `_export_encoder_onnx()` -- Exports encoder to ONNX with cache-aware I/O at startup.
- `_build_trt_engine()` -- Builds TRT engine with FP16 and dynamic optimization profile.
  Cached to Modal Volume after first build.
- `NemotronASR` -- Modal class with WebSocket streaming, HTTP batch, health check,
  event-driven batching, backpressure, and per-stream state management.
- Falls back to `torch.compile` if TRT build fails.

**`nemotron_asr/pipelines.py`** -- NeMo streaming pipeline adapted for multi-stream
concurrent inference. `CacheAwareRNNTStreamingPipeline` handles frame batching,
context management, greedy RNNT decoding, and endpointing.

**`nemotron_asr/bench_modal.py`** -- Benchmark tool. Runs WebSocket or HTTP tests at
configurable concurrency levels with WER/CER quality scoring against FLEURS Hindi.

```bash
# Example: sweep concurrency 1 to 500
python nemotron_asr/bench_modal.py \
  --url https://mayaresearch--hindi-nemotron-asr-nemotronasr-webapp.modal.run \
  --mode ws --concurrency 1,10,50,100,200,300,400,500
```

## Architecture

```
Browser/Client
    |
    | WebSocket (PCM 16kHz mono)
    v
Modal Container (L40S + 8 CPU)
    |
    +-- recv_loop: receive audio chunks, buffer, VAD gate, append to pipeline
    |
    +-- centralized_inference_loop (event-driven batching):
    |       |
    |       +-- pipeline.process_streaming_batch()
    |       |       |
    |       |       +-- CacheFeatureBufferer (mel spectrogram)
    |       |       +-- TRTEncoderWrapper.cache_aware_stream_step()
    |       |       |       |
    |       |       |       +-- TRT FP16 engine (24-layer conformer)
    |       |       |       +-- streaming_post_process (cache truncation)
    |       |       |
    |       |       +-- RNNT greedy decoder
    |       |       +-- BPE decoder + endpointing
    |       |
    |       +-- route_outputs() -> per-stream queues
    |
    +-- send_loop: drain queue, send JSON transcription to client
```
