---
name: vLLM Streaming Monkey-Patch
overview: Monkey-patch vLLM to enable true token-by-token streaming with parallel head predictions via a shared memory side-channel, giving constant ~40ms TTFB regardless of text length while keeping vLLM's concurrency benefits.
todos:
  - id: mp1
    content: Update model_vllm.py compute_logits to write parallel preds to /dev/shm
    status: pending
  - id: mp2
    content: Create engine_async.py with AsyncLLMEngine wrapper + shared memory reader
    status: pending
  - id: mp3
    content: Add /synthesize/stream/v2 endpoint to server.py with true token-by-token streaming
    status: pending
  - id: mp4
    content: Test TTFB across 2, 5, 10, 20, 35 word texts — verify constant ~65ms
    status: pending
isProject: false
---

# vLLM Streaming with Parallel Heads — Monkey-Patch Plan

## The Problem

vLLM's `LLM.generate()` is batch-only. `AsyncLLMEngine.generate()` CAN stream tokens, but has no `apply_model` — so we can't extract cb2/cb3/cb4 predictions from our parallel heads during streaming. The output pipeline uses strictly-typed `msgspec.Struct` objects serialized over ZMQ between processes, with no extensible fields.

## The Solution: Shared Memory Side-Channel

Instead of modifying vLLM's output pipeline (7+ files), use `/dev/shm` (in-memory tmpfs) as a side-channel between the engine core process (where the model runs) and the API server process (where streaming happens).

```mermaid
sequenceDiagram
    participant API as FastAPI Server
    participant Async as AsyncLLMEngine
    participant Core as EngineCore Process
    participant Model as ParallelHeadModel
    participant SHM as /dev/shm

    API->>SHM: Clear buffer file
    API->>Async: generate(prompt, params)
    
    loop Each decode step
        Core->>Model: forward() + compute_logits()
        Model->>SHM: Write cb2,cb3,cb4 predictions
        Core->>Async: yield new token_id
        Async->>API: yield RequestOutput
        API->>SHM: Read new predictions
        
        alt Every 10 frames
            API->>API: NanoCodec decode chunk
            API-->>API: yield audio PCM bytes
        end
    end
```


## Files to Change

### 1. `model_vllm.py` — Write predictions to shared memory

In `compute_logits`, instead of appending to `self._parallel_buffer`, write predictions to `/dev/shm/tts_preds`:

```python
def compute_logits(self, hidden_states):
    logits = self.logits_processor(self.lm_head, hidden_states)
    if hidden_states.shape[0] <= 8:  # decode mode
        cb2 = torch.argmax(self.cb2_head(hidden_states), dim=-1).cpu().tolist()
        cb3 = torch.argmax(self.cb3_head(hidden_states), dim=-1).cpu().tolist()
        cb4 = torch.argmax(self.cb4_head(hidden_states), dim=-1).cpu().tolist()
        # Write to shared memory (tmpfs, ~1 microsecond)
        with open("/dev/shm/tts_preds", "a") as f:
            for i in range(len(cb2)):
                f.write(f"{cb2[i]},{cb3[i]},{cb4[i]}\n")
    return logits
```

Keep the existing `_parallel_buffer` approach too (for the sync `LLM` path).

### 2. New `engine_async.py` — AsyncLLMEngine wrapper

- Register model, create `AsyncLLMEngine` (not `LLM`)
- `generate_stream()` method that:
  - Clears `/dev/shm/tts_preds`
  - Calls `engine.generate()` (async generator)
  - For each yielded token, reads new prediction lines from the shared file
  - Yields `(token_id, cb2, cb3, cb4)` tuples

### 3. Updated `server.py` — True streaming endpoint

New `/synthesize/stream/v2` endpoint:

- Uses the async engine
- Accumulates audio frames as tokens stream in
- Every 10 frames, decodes with NanoCodec and yields audio chunk
- TTFB = prefill (~~25ms) + 10 decode steps (~~30ms) + chunk decode (~~10ms) = **~~65ms constant**

### Key Details

- `**/dev/shm` performance**: tmpfs (RAM-backed), file I/O is <1 microsecond. No disk involved.
- **Synchronization**: `compute_logits` runs synchronously before the token is emitted. By the time the API server sees a new token, the prediction is already written.
- **Concurrency**: For single-request (voice agent), works perfectly. For multi-request, tag predictions with request_id or use per-request files.
- **Cleanup**: Clear the file at the start of each request.

## What We Keep

- All existing endpoints (`/synthesize`, `/synthesize/stream`, `/synthesize/stream/fast`) using sync `LLM` — unchanged
- The baked RoPE fix in `model_vllm.py` — unchanged
- vLLM's PagedAttention, KV cache, CUDA graphs — all active

## Expected Results


| Metric          | Current (sync LLM) | After (async streaming) |
| --------------- | ------------------ | ----------------------- |
| TTFB (2 words)  | 68ms               | ~65ms                   |
| TTFB (20 words) | 271ms              | ~65ms                   |
| TTFB (35 words) | 587ms              | ~65ms                   |
| Concurrency     | 200-500            | 200-500                 |
| Audio quality   | Same               | Same                    |