---
name: ASR Benchmark Improvements
overview: "Address 4 gaps identified in the current benchmarks: missing WER quality metrics, explore TensorRT encoder export, optimize GPU selection for cost/VRAM efficiency, and eliminate Modal tunnel overhead where possible. Remove SSE from all testing."
todos:
  - id: wer-bench
    content: Add jiwer WER/CER computation to bench_modal.py, match transcriptions to ground truth, remove SSE mode
    status: completed
  - id: wer-run
    content: Re-run WS and HTTP benchmarks on H100 with WER measurement at each concurrency level
    status: completed
  - id: cleanup-modal
    content: Remove dead modal.forward() tunnel + uvicorn thread + SSE endpoint from nemotron_asr.py
    status: in_progress
  - id: add-cpu
    content: Add cpu=4.0 to @app.cls() decorator, redeploy, measure TTFT improvement at high concurrency
    status: pending
  - id: bench-l40s
    content: Deploy on L40S, run WS+HTTP benchmark sweep with WER, compare cost/stream vs H100
    status: pending
  - id: bench-l4
    content: Deploy on L4, run WS+HTTP benchmark sweep with WER, find capacity floor
    status: pending
  - id: trt-export
    content: Export encoder subnet to ONNX via model.export(), validate output correctness
    status: pending
  - id: trt-engine
    content: Convert ONNX encoder to TensorRT engine, integrate into streaming pipeline, benchmark
    status: pending
---

# ASR Benchmark: WER, TensorRT, GPU Selection, Tunnel Fix

## Analysis of the 4 Questions

### Q1: Why no TensorRT yet?

**Short answer**: RNNT streaming + TensorRT is non-trivial because of autoregressive decoding and cache state management.

The model IS Exportable. `EncDecRNNTBPEModel` inherits from `Exportable` and exposes `.export()`, `.to_onnx()`, and `list_export_subnets()` (encoder + decoder_joint). However:

- The **encoder** (FastConformer, 24 layers) is the compute bottleneck and IS a good TensorRT target — it's a pure feed-forward model.
- The **RNNT decoder** is autoregressive (each token depends on the previous one) — TensorRT optimization here is limited.
- The **cache-aware streaming** pipeline in `pipelines.py` deeply couples cache state management (context_manager, state_pool, bufferer) with the encoder forward pass. Swapping in a TensorRT encoder means rewriting `cache_aware_transcribe_step()` in `pipelines.py` (lines 705-773) to handle the engine I/O.

**What's feasible**: Export ONLY the encoder to ONNX, then to TensorRT. Keep the RNNT decoder in PyTorch. This is exactly what NVIDIA's Riva does internally and would yield ~2x encoder speedup.

**What the NVIDIA Modal reference does**: The [modal-parakeet](https://github.com/modal-projects/modal-parakeet) repo uses NeMo Python pipeline directly — NO TensorRT. Their blog's "560 streams" number comes from internal Riva/NIM, not Modal.

### Q2: Modal Tunnel Latency

**Key finding**: Our benchmarks already bypass the tunnel. The URL `https://mayaresearch--hindi-nemotron-asr-nemotronasr-webapp.modal.run` is served via `@modal.asgi_app()`, which is Modal's native ASGI hosting — NOT the `modal.forward(8000)` tunnel.

The current code runs BOTH paths redundantly:

- `@modal.asgi_app()` at `webapp()` (line 1260) — Modal's native ASGI, used by benchmarks
- `modal.forward(8000)` + uvicorn thread (lines 865-874) — separate tunnel, only used by the frontend

**The tunnel code is dead weight for benchmarks.** We should remove the uvicorn thread + `modal.forward()` entirely and serve everything through `@modal.asgi_app()`.

The ~400ms TTFT gap vs NVIDIA's 182ms is from:

- No TensorRT (~2x slower per frame) — **this is the main factor**
- Modal's proxy/CDN layer (~50-100ms per request)
- Python asyncio event loop overhead at high concurrency

There is **no way to get direct IP access on Modal** — it's a managed platform. The proxy is fundamental.

**What WILL help**: Increasing CPU cores from the default 0.125 to 4-8 physical cores. Our async event loop handles hundreds of WebSocket connections, audio preprocessing, and queue routing — all CPU work. Default 0.125 cores means we're CPU-throttled at high concurrency. Cost: ~$0.048/hr per core, negligible vs H100 at $3.95/hr.

### Q3: VRAM and GPU Selection

The model uses **10.5GB constant** regardless of concurrency (1 or 400 streams). This is by design — cache-aware streaming reuses hidden states in-place.

**Multiple replicas on one GPU**: Great idea but NOT possible on Modal (one GPU per container). On Vast.ai/bare metal, you could run 4-6 replicas on H100 (4x10.5 = 42GB, 52% utilized). On Modal, the answer is: **pick a smaller GPU**.

**GPU comparison (compute-bound workload, 10.5GB model)**:

- **L4 (24GB, 121 TFLOPS)**: 44% VRAM used. ~50 streams estimated. $0.59/hr. Best VRAM utilization.
- **L40S (48GB, 362 TFLOPS)**: 22% VRAM used. ~145 streams estimated. $1.60/hr. Best cost/stream ratio for medium scale.
- **A100-40GB (40GB, 312 TFLOPS)**: 26% VRAM used. ~125 streams. $2.78/hr. Middle ground.
- **H100 (80GB, 990 TFLOPS)**: 13% VRAM used. ~400 streams proven. $3.95/hr. Raw throughput king.

**Per-stream cost analysis**:

- L4: $0.59/50 = **$0.012/stream-hr** (44% VRAM util)
- L40S: $1.60/145 = **$0.011/stream-hr** (22% VRAM util)
- H100: $3.95/400 = **$0.010/stream-hr** (13% VRAM util)

All three are within 20% per-stream cost. The difference is operational: L4 needs more containers for the same total capacity.

**TensorRT would NOT increase VRAM usage** — it typically reduces it. The only way to get 70-80% VRAM is multi-replica (impossible on Modal) or a much larger model (3B+).

**Recommendation**: Benchmark L40S and L4 alongside H100 to give the user actual data for the cost/VRAM tradeoff decision.

### Q4: Missing WER Scores

This is the most critical gap. We benchmarked latency/throughput but never measured quality degradation vs concurrency. We have ground truth text in `benchmark_data/fleurs_hindi/manifest.json` and our benchmark captures transcription text — we just never computed WER.

The quality degradation curve is the MOST important output: "at what concurrency does WER degrade beyond acceptable?"

## Execution Plan

### Phase 1: Add WER to benchmarks and re-run (highest priority)

- Add `jiwer` WER/CER computation to `bench_modal.py`
- Match each benchmark stream to its ground truth from the manifest
- Re-run WS and HTTP benchmarks on H100 with WER at each concurrency level
- Remove SSE mode from the benchmark script entirely
- This gives us the quality degradation curve: concurrency vs WER

### Phase 2: Clean up Modal deployment and add CPU cores

- Remove the dead `modal.forward()` + uvicorn thread from `nemotron_asr.py`
- Remove SSE endpoint
- Add `cpu=4.0` to the `@app.cls()` decorator (default 0.125 is absurdly low for 400 WS connections)
- Redeploy and re-benchmark to measure CPU impact on TTFT at high concurrency

### Phase 3: Benchmark on L40S and L4

- Deploy same app with `gpu="L40S"` and `gpu="L4"` (separate Modal apps or GPU fallbacks)
- Run WS+HTTP benchmark sweep on each
- Compare: streams per GPU, WER vs concurrency, VRAM utilization, cost per stream

### Phase 4: TensorRT encoder export (experimental)

- Export encoder subnet to ONNX via `model.export()`
- Convert ONNX to TensorRT engine via `trtexec` (install in Modal image)
- Create a thin wrapper that replaces `asr_model.stream_step()` encoder call with TensorRT engine call
- This requires modifying `cache_aware_transcribe_step()` in `pipelines.py` to split the encoder/decoder paths
- Test locally first, then deploy to Modal
- Expected: ~2x encoder speedup, pushing H100 capacity toward 600-800 streams

**Note**: Phase 4 is the riskiest — RNNT cache state management is tightly coupled to the encoder. If the ONNX export doesn't handle the cache tensors correctly, we may need to restructure the pipeline significantly. We should validate the ONNX export works before committing to full TensorRT integration.