---
name: Production Polish and Deploy
overview: "Polish the Triton ASR server for production: fix the 560ms latency wall (chunk size), add GPU/PyTorch telemetry, add adaptive batch wait, implement chaos tests, run a sweep grid to find the per-GPU sweet spot, then deploy to Vast.ai Serverless using their PyWorker/WorkerConfig pattern with the existing production/worker.py as reference."
todos:
  - id: p1-chunk
    content: "Phase 1: Change att_context_size default to [70,1], reduce batch_latency to 10ms, add adaptive batch wait"
    status: in_progress
  - id: p2-telem
    content: "Phase 2: Add PyTorch allocator metrics + event loop lag to engine and prometheus endpoint"
    status: pending
  - id: p3-chaos
    content: "Phase 3: Implement test_chaos.py with 4 failure scenarios (slow consumer, abandon, churn, malformed)"
    status: pending
  - id: p4-sweep
    content: "Phase 4: Run sweep grid (chunk_size x batch_wait x concurrency), produce CSV with SLO pass/fail"
    status: pending
  - id: p5-vast-worker
    content: "Phase 5a: Create worker.py + onstart.sh for Vast PyWorker integration"
    status: pending
  - id: p5-vast-docker
    content: "Phase 5b: Update Dockerfile to bake model, copy triton_asr code, expose correct ports"
    status: pending
  - id: p5-vast-deploy
    content: "Phase 5c: Deploy to Vast Serverless, run smoke + stress tests through /route/"
    status: pending
---

# Production Polish, Latency Fix, Telemetry, and Vast.ai Deployment

## Phase 1: Fix the 560ms Latency Wall

The root cause is confirmed: `att_context_size = [70, 6]` forces 560ms of audio buffering before the model can emit a single token. The model natively supports `[70, 0]` (80ms) through `[70, 13]` (1.12s) with only 0.6% WER cost at the fastest setting.

### Changes

**[config.py](triton_asr/config.py)** line 33 -- make chunk mode configurable via env var:
```python
att_context_size: List[int] = field(
    default_factory=lambda: [70, int(os.environ.get("ATT_RIGHT_CONTEXT", "1"))]
)
```
Default to `[70, 1]` (160ms chunks). Expose `ATT_RIGHT_CONTEXT` env var for per-deployment tuning.

**[config.py](triton_asr/config.py)** -- reduce `max_batch_latency_ms` default from 50 to 10:
```python
max_batch_latency_ms: float = float(os.environ.get("MAX_BATCH_LATENCY_MS", "10"))
```
At 80ms chunk size, 50ms batch wait is 63% of one chunk period. 10ms keeps batch overhead under 12%.

### Adaptive batch wait (engine.py)

Replace the fixed `max_wait_s` in the inference loop with a function of active stream count:
- `active < 10`: 5ms (single-digit streams = latency-sensitive)
- `active < 50`: 15ms
- `active >= 50`: 40ms (many streams = throughput-sensitive, batch fills fast anyway)

This is ~6 lines in `_inference_loop()`.

### Expected first-token after fix

| Concurrency | Before (560ms chunks) | After (160ms chunks, 10ms wait) |
|---:|---:|---:|
| 1 | 539ms | ~170ms |
| 50 | 694ms | ~230ms |
| 100 | 1209ms | ~350ms |

---

## Phase 2: GPU + PyTorch Telemetry

Add real observability to `/metrics` and `/metrics/prometheus`.

### PyTorch allocator metrics ([engine.py](triton_asr/engine.py))

Add a `gpu_telemetry()` method that returns:
- `vram_allocated_bytes` -- `torch.cuda.memory_allocated()`
- `vram_reserved_bytes` -- `torch.cuda.memory_reserved()`
- `vram_peak_allocated_bytes` -- `torch.cuda.max_memory_allocated()`
- `cuda_alloc_retries` -- from `torch.cuda.memory_stats()["num_alloc_retries"]`
- `cuda_ooms` -- from `torch.cuda.memory_stats()["num_ooms"]`
- `reserved_minus_allocated` -- fragmentation proxy

### Event loop lag metric ([engine.py](triton_asr/engine.py))

Add a simple periodic check: schedule a callback, measure how late it fires vs expected. Expose `event_loop_lag_ms` in metrics. If this climbs, everything async is fake.

### Prometheus exposition ([gateway.py](triton_asr/gateway.py))

Add the new metrics to the `/metrics/prometheus` endpoint as gauges/counters.

---

## Phase 3: Chaos / Failure Tests

Add [test_chaos.py](triton_asr/test_chaos.py) with 4 scenarios:

- **Slow consumer**: client reads 1 msg/sec while server produces 20/sec. Assert: bounded queue holds, partials drop, finals delivered, no crash.
- **Abandon**: client connects, sends 2 chunks, then goes silent (no END). Assert: idle timeout triggers within `ws_idle_timeout_s`, stream cleaned up, VRAM stable.
- **Churn storm**: 200 rapid connect/disconnect cycles in 30s. Assert: `active_streams` returns to 0, no VRAM growth, no state leak.
- **Malformed frames**: send random bytes, wrong-size frames, text where binary expected. Assert: server rejects gracefully, no engine crash, error metric increments.

---

## Phase 4: Sweep Grid (chunk size x batch wait x concurrency)

Run a structured sweep to find the per-GPU sweet spot. Script: [test_sweep.py](triton_asr/test_sweep.py).

Grid:
- `ATT_RIGHT_CONTEXT`: 0, 1, 6 (80ms, 160ms, 560ms)
- `MAX_BATCH_LATENCY_MS`: 5, 10, 30
- Concurrency: 1, 10, 50, 100, 150

For each cell, record:
- WS TTFT p50/p95
- Drift p99
- Throughput (x realtime)
- GPU metrics (vram, alloc retries)
- Error count

Output: single CSV + JSON. Define SLO pass/fail: `TTFT p95 < 300ms AND errors = 0 AND drift p99 < 1000ms`.

The highest concurrency that passes all SLOs at the best chunk/wait combo = the sweet spot.

---

## Phase 5: Vast.ai Serverless Deployment

### Architecture on Vast

```
Client --> POST /route/ --> Vast serverless engine --> worker IP:port
Client --> POST worker:8000/v1/audio/transcriptions --> PyWorker --> model server:18000
```

The existing [production/worker.py](production/worker.py) and [production/onstart.sh](production/onstart.sh) already implement this pattern correctly. The triton_asr version needs the same files adapted for the new codebase.

### Files to create in `triton_asr/`

**`worker.py`** -- Vast PyWorker config. Based on [production/worker.py](production/worker.py) but pointing at the triton_asr server. Key elements from the Vast docs:
- `WorkerConfig` with `model_server_port=18000`, `model_log_file=/var/log/model/server.log`
- `HandlerConfig` for `/v1/audio/transcriptions`, `/v1/audio/transcriptions/json`, `/v1/audio/transcriptions/stream`, `/metrics`
- `BenchmarkConfig` with generator that produces 1s silence WAV base64 (same as production/worker.py)
- `LogActionConfig` with `on_load=["Application startup complete."]`
- `workload_calculator` returns estimated audio seconds (from base64 length or explicit field)

**`onstart.sh`** -- Startup script. Based on [production/onstart.sh](production/onstart.sh):
- Starts model server (`python server.py`) on port 18000 in background, logs to `/var/log/model/server.log`
- Starts PyWorker (`python worker.py`) on port 8000 in background
- Monitors both, exits if either dies

**Update `Dockerfile`** -- Based on [production/Dockerfile](production/Dockerfile):
- Bake model into image (HF download at build time, not runtime -- eliminates 2-3 min cold start)
- Copy triton_asr code instead of production/ code
- Copy nemotron_asr pipeline files (pipelines.py, multi_stream_utils.py, asr_utils.py)
- `EXPOSE 8000 18000`

### Vast Serverless Parameters (from [vast_docs/serverless_params.md](vast_docs/serverless_params.md))

Using the measured sweet spot from Phase 4 (expected ~100-120 concurrent streams per A100):

| Parameter | Value | Rationale |
|:---|:---|:---|
| cold_mult | 2 | Plan 2x current capacity for cold workers |
| min_workers | 2 | Avoid cold start for first users |
| max_workers | 10 | Budget cap |
| min_load | 50 | Keep 50 audio-sec/s capacity minimum |
| target_util | 0.8 | 25% headroom for traffic spikes |

### Deployment sequence

1. Build Docker image with model baked in
2. Push to Docker Hub / GHCR
3. Create Vast template with the image + env vars
4. Create Endpoint + WorkerGroup via Vast dashboard or SDK
5. Wait for workers to reach "Ready" (benchmark completes)
6. Run smoke test via `/route/` + direct worker call
7. Run stress test via direct WS to worker IP (discovered via instance API)