---
name: vLLM Inference Server
overview: Build a production vLLM-based inference server for the 350M parallel-head TTS model with KV cache, PagedAttention, continuous batching, INT4 quantization, and streaming audio output.
todos:
  - id: v1
    content: Install vLLM, create /home/ubuntu/kani-tts-inference/ project structure
    status: completed
  - id: v2
    content: Build vLLM-compatible model wrapper with parallel head support
    status: completed
  - id: v3
    content: Build generation engine that extracts cb2/cb3/cb4 alongside main token
    status: completed
  - id: v4
    content: Build NanoCodec decoder with streaming support
    status: completed
  - id: v5
    content: Build FastAPI server with /synthesize and /synthesize/stream endpoints
    status: in_progress
  - id: v6
    content: Run RTF and concurrency benchmarks
    status: pending
isProject: false
---

# vLLM Inference Server for Parallel-Head TTS

## Challenge

vLLM natively supports single lm_head output. Our model has 4 heads (lm_head + cb2/cb3/cb4). We need to modify vLLM's generation loop to extract parallel head predictions alongside the main autoregressive token.

## Approach

Use vLLM's `LLM` class with a custom model wrapper that registers our `ParallelHeadLfm2ForCausalLM` as a vLLM-compatible model. Override the model's forward to return both lm_head logits (for autoregressive sampling) and parallel head predictions (stored in a side channel).

## Architecture

```
Client (HTTP) --> FastAPI Server --> vLLM Engine --> ParallelHeadModel
                                        |
                                  PagedAttention
                                  KV Cache
                                  Continuous Batching
                                  CUDA Graphs
                                        |
                                  lm_head --> cb1 token (autoregressive)
                                  cb2_head --> cb2 token (parallel)
                                  cb3_head --> cb3 token (parallel)
                                  cb4_head --> cb4 token (parallel)
                                        |
                                  NanoCodec Decode --> WAV stream
```

## Key Files

All code in `/home/ubuntu/kani-tts-inference/`:

- `model_vllm.py` -- vLLM-compatible model wrapper that registers our parallel head model
- `engine.py` -- Custom generation engine that handles parallel head extraction
- `server.py` -- FastAPI server with streaming audio endpoint
- `codec.py` -- NanoCodec decoder for converting tokens to audio
- `benchmark.py` -- RTF and concurrency benchmarking script
- `requirements.txt` -- Dependencies (vllm, fastapi, uvicorn)

## Steps

### Step 1: Install vLLM and create project structure
- `pip install vllm` and set up the folder

### Step 2: Model wrapper for vLLM
- Register `ParallelHeadLfm2ForCausalLM` with vLLM's model registry
- Override forward to store cb2/cb3/cb4 predictions in a buffer during generation
- Handle the custom token vocabulary (68,442 instead of standard)

### Step 3: Generation engine
- Use vLLM's `SamplingParams` for temperature, top_p, etc.
- After each token is generated, if it's an audio token, extract cb2/cb3/cb4 from the parallel heads using the cached hidden state
- Collect all 4 codebook streams

### Step 4: NanoCodec decoder
- Take the 4 codebook streams, decode to waveform
- Support streaming: decode in chunks as frames are generated

### Step 5: FastAPI server
- POST `/synthesize` -- text in, wav file out
- POST `/synthesize/stream` -- text in, chunked audio stream out
- Speaker selection via request body

### Step 6: Benchmark
- Measure RTF single request
- Measure RTF at 10, 50, 100 concurrent requests
- Measure TTFB (time to first audio byte)
- Compare with current naive inference
