---
name: ASR Latency Diagnosis Fix
overview: The 539ms WebSocket first-token latency is not a code bug — it is a direct consequence of `att_context_size = [70, 6]` which requires 560ms of audio before the model can produce any output. Switching to `[70, 1]` (80ms chunks) drops theoretical FT to ~130ms with only 0.6% WER increase (7.84% vs 7.22%).
todos:
  - id: chunk-mode
    content: Change att_context_size default from [70, 6] to [70, 1] in config.py with env var override
    status: pending
  - id: batch-latency
    content: Reduce max_batch_latency_ms from 50 to 10 in config.py
    status: pending
  - id: timing
    content: Add per-frame timing instrumentation in engine.py (gated behind DEBUG_TIMING env var)
    status: pending
  - id: rebench
    content: Restart server with new config, run full WS benchmark sweep, compare FT numbers
    status: pending
  - id: readme
    content: Update README with new benchmark results and latency breakdown
    status: pending
---

# ASR Latency Root Cause and Fix Plan

## Root Cause: The 560ms Audio Buffering Wall

The entire first-token latency budget is consumed by **waiting for enough audio to fill one model frame**.

### The math (current: `att_context_size = [70, 6]`)

```
frame_bytes = (att_context_size[1] + 1) * 1280 * 2
            = (6 + 1) * 1280 * 2
            = 17,920 bytes
            = 560ms of 16kHz 16-bit audio
```

The gateway ([gateway.py](triton_asr/gateway.py) line 309) buffers audio until `frame_threshold = engine.frame_bytes` is reached. The NeMo pipeline physically cannot produce output until it has one full frame. So:

```
WS c=1 timeline:
  t=0ms      client sends chunk 1 (160ms, 5120 bytes)
  t=160ms    client sends chunk 2 (5120 bytes, total=10240)
  t=320ms    client sends chunk 3 (5120 bytes, total=15360)
  t=480ms    client sends chunk 4 (5120 bytes, total=20480 > 17920) --> FEED TO ENGINE
  t=480ms    inference loop wakes (0-50ms batch wait)
  t=535ms    GPU forward pass completes (~55ms)
  t=536ms    output routed to queue, sent to client
  ----
  Observed FT: ~539ms   (480ms buffering + 50ms scheduler + 10ms GPU/routing)
```

**560ms of the 539ms first-token is just waiting for audio to accumulate.** The GPU inference is only ~55ms. The compute is not the bottleneck — the chunk size configuration is.

### Nemotron supported chunk modes (no retraining needed)

The model supports 4 modes by changing `att_context_size` at inference time:

| att_context_size | Chunk | Min FT (theory) | WER (English) | Delta |

|:---|---:|---:|---:|:---|

| [70, 1] | 80ms | ~130ms | 7.84% | baseline |

| [70, 2] | 160ms | ~210ms | 7.84% | same WER |

| **[70, 6] **(current) | **560ms** | **~590ms** | **7.22%** | **-0.62%** |

| [70, 13] | 1120ms | ~1170ms | 7.16% | -0.68% |

Switching from [70, 6] to [70, 1] costs **0.62% WER** and buys **~430ms** of first-token latency. For a voice agent, this is not a tradeoff — it is mandatory.

### HTTP path analysis

The scheduler-routed HTTP path (v2) shows 1.5s at c=1 for 5s audio:

- 5s audio / 0.56s frame = ~9 frames
- Each frame: up to 50ms batch wait + 55ms GPU = 105ms
- 9 frames * 105ms = ~945ms inference + overhead = ~1.5s

With 80ms chunks: 5s / 0.08s = 62 frames, but GPU time per frame is ~8-10ms (smaller frames = less computation), so total ~62 * 15ms = ~930ms. Similar total GPU time but the batch_wait overhead per frame matters more. Solution: for bulk HTTP feeds, the batcher should drain consecutively without re-waiting (the frames are already buffered).

## The Fix (3 changes)

### Change 1: Switch to 80ms chunks (the big win)

In [config.py](triton_asr/config.py) line 33, change the default and add env var override:

```python
att_context_size: List[int] = field(
    default_factory=lambda: [70, int(os.environ.get("ATT_RIGHT_CONTEXT", "1"))]
)
```

This single change drops WS first-token from ~539ms to ~130ms.

### Change 2: Reduce max_batch_latency_ms for low concurrency

Currently 50ms. With 80ms chunks, the batch wait becomes a larger fraction of per-frame time. Reduce to 10ms:

```python
max_batch_latency_ms: float = float(os.environ.get("MAX_BATCH_LATENCY_MS", "10"))
```

At c=1 this saves ~40ms. At high concurrency the event signal fires immediately anyway (0ms wait), so this only affects low-load latency.

### Change 3: Add per-frame timing instrumentation

Add timing logs inside the inference loop and gateway so we can verify the fix and identify any remaining bottlenecks:

- `t_buffered`: when gateway feeds audio to engine
- `t_batch_start`: when inference loop picks up the batch
- `t_gpu_done`: after pipeline.process_streaming_batch()
- `t_routed`: after output reaches client queue

This should be lightweight (perf_counter calls) and gated behind a `DEBUG_TIMING` env var.

## Expected results after fix

| Metric | Before (560ms chunks) | After (80ms chunks) | Change |

|:---|---:|---:|:---|

| WS FT c=1 | 539ms | ~130ms | -76% |

| WS FT c=50 | 694ms | ~200ms | -71% |

| WS FT c=100 | 1209ms | ~350ms | -71% |

| HTTP c=1 (5s audio) | 1481ms | ~900ms | -39% |

| WER | 7.22% | 7.84% | +0.62% |

| VRAM | 10.4GB | ~10.4GB | unchanged |

Note: GPU inference per frame will be faster (smaller tensor), so concurrency capacity may actually increase.