---
name: TTS Pipeline Optimization
overview: "Systematic optimization of the Veena3 TTS pipeline targeting two axes: (1) reduce streaming TTFB from 336ms@1c/1719ms@20c to sub-200ms@1c/sub-500ms@20c, and (2) increase concurrent streams from 5-10 to 50-100+ per A100-80GB GPU."
todos:
  - id: tier1-parser-singleton
    content: "1A: Make BiCodecTokenParser a singleton initialized once at runtime startup, shared across all streaming requests"
    status: completed
  - id: tier1-remove-to-device
    content: "1B: Remove redundant model.to(device) call on every BiCodec decode in bicodec_decoder.py line 119-120"
    status: completed
  - id: tier1-unload-wav2vec
    content: "1C: Unload Wav2Vec2-Large from GPU after BiCodecTokenizer init (or skip loading entirely for decode-only mode)"
    status: completed
  - id: tier1-prune-dead-modules
    content: "1D: Delete unused BiCodec sub-modules (encoder, postnet, mel_transformer) after model load to free ~140MB VRAM"
    status: completed
  - id: tier2-fixed-window-decode
    content: "2A: Replace O(n^2) full-buffer re-decode with fixed-size sliding window BiCodec decode in streaming pipeline"
    status: completed
  - id: tier2-reduce-gpu-mem
    content: "2B: Reduce gpu_memory_utilization from 0.85 to 0.5 (frees ~28GB, still supports 770+ concurrent sequences)"
    status: completed
  - id: tier2-prompt-token-ids
    content: "2C: Pass pre-tokenized prompt_token_ids to vLLM instead of string prompts"
    status: completed
  - id: tier3-bicodec-cpu
    content: "3A: Move BiCodec decode to CPU with async ThreadPoolExecutor execution (frees GPU for vLLM)"
    status: cancelled
  - id: tier3-pregen-globals
    content: "3C: Pre-generate and cache global tokens per speaker at startup, use continuation prompt for all requests"
    status: completed
  - id: tier3-decode-interval
    content: "3D: Reduce DECODE_INTERVAL to 12 and MIN_SEMANTIC_FOR_FIRST_CHUNK to 8"
    status: completed
  - id: reprofile
    content: Re-run profile_pipeline.py and stress_test_local.py after each tier to validate improvements
    status: completed
isProject: false
---

# TTS Pipeline Optimization Plan

## The Problem

A 0.5B param LLM + ~105M param BiCodec decoder on an 80GB A100 only achieves 5-10 concurrent streams at acceptable latency. The profiling revealed three root causes consuming 95%+ of time.

## Current Architecture (with profiled bottlenecks)

```mermaid
flowchart TD
    subgraph request [Per Request Flow]
        A[Prompt Build 0.03ms] --> B[vLLM Prefill 12-1559ms]
        B --> C[vLLM Decode 405ms at 296 tok/s]
        C --> D[Token Parse 0.01ms/tok]
        D --> E[BiCodec Decode 18-42ms]
        E --> F[Crossfade 0.1ms]
    end
    
    subgraph bottlenecks [Bottlenecks]
        B1["#1 vLLM Prefill Contention: 12ms at 1c -> 1559ms at 20c"]
        B2["#2 BiCodec O(n^2) Re-decode: full buffer re-decoded every 24 tokens"]
        B3["#3 Parser Init: 123ms per request scanning 166k vocab"]
        B4["#4 1.5GB Dead Weight on GPU: Wav2Vec2 + encoder never used in decode"]
        B5["#5 model.to(device) on every decode call: redundant"]
    end
```


## GPU Memory Audit

Current allocation on A100-80GB:

- vLLM claims 85% = **69.6 GB** (model weights ~1 GB, KV cache ~65.6 GB = room for **1,399 concurrent sequences**)
- BiCodec full model (fp32): ~420 MB (only ~240 MB active during decode)
- Wav2Vec2-Large-XLSR-53: **~1.2 GB -- NEVER USED during decode**
- BiCodec encoder + postnet: **~140 MB -- NEVER USED during decode**

We have 1,399 KV cache slots but can only serve 5-10 streams. The bottleneck is NOT memory -- it is compute scheduling and redundant work.

---

## Optimization Tiers

### TIER 1: Zero-risk, high-impact fixes (no architecture changes)

#### 1A. Singleton BiCodecTokenParser (-123ms TTFB per streaming request)

The `BiCodecTokenParser` is re-created per streaming request in [streaming_pipeline.py](veena3modal/core/streaming_pipeline.py) line 282. Each creation scans all 166,000 vocab entries. The cache is deterministic (same tokenizer = same cache), so it should be a module-level singleton initialized once at startup.

**Change**: Create the parser once in `tts_runtime.py` during `initialize_runtime()` and pass it to the streaming pipeline.

**Impact**: 123ms -> 0ms per streaming request. Pure TTFB reduction.

#### 1B. Remove redundant `.to(device)` per decode call (-GPU stall per decode)

In [bicodec_decoder.py](veena3modal/core/bicodec_decoder.py) line 119-120:

```python
self.audio_tokenizer.device = self.device
self.audio_tokenizer.model.to(self.device)  # Called EVERY decode!
```

`model.to()` is a no-op when already on the right device, but it still triggers a device check + parameter iteration on a 105M param model. Remove it -- the model is already on the correct device from init.

#### 1C. Unload Wav2Vec2 from GPU (+1.2 GB free VRAM)

[audio_tokenizer.py](external/sparktts/sparktts/models/audio_tokenizer.py) line 52-54 loads the 315M-param Wav2Vec2-Large model onto GPU. It is ONLY used by `tokenize()` and `extract_wav2vec2_features()`, which are encoding functions. The TTS decode path (`detokenize()`) never touches it.

**Options (in order of simplicity)**:

- Move Wav2Vec2 to CPU after BiCodecTokenizer init: `self.feature_extractor.cpu()`
- Lazy-load: only load Wav2Vec2 if `tokenize()` is called
- Best: subclass BiCodecTokenizer to skip Wav2Vec2 loading entirely

#### 1D. Remove dead BiCodec sub-modules from GPU (+140 MB free VRAM)

After BiCodec loads from checkpoint, call:

```python
del self.audio_tokenizer.model.encoder      # ~23M params, encode-only
del self.audio_tokenizer.model.postnet       # ~12M params, training-only
del self.audio_tokenizer.model.mel_transformer  # MelSpec, encode-only
torch.cuda.empty_cache()
```

Only `quantizer`, `speaker_encoder`, `prenet`, and `decoder` (WaveGenerator) are needed for detokenize.

---

### TIER 2: Algorithmic fixes (major impact, moderate complexity)

#### 2A. Fixed-window BiCodec decode -- eliminate O(n^2) (BIGGEST WIN for streaming)

**Current problem**: Every 24 new semantic tokens, the streaming pipeline calls `detokenize()` with ALL accumulated semantic tokens. For a 10s utterance (~500 tokens), this means 21 decode calls of size 16, 40, 64, ..., 500. Total token-decodes: ~5,376 instead of 500.

**Proposed fix**: Use a fixed-size sliding window for each decode call (like the existing SNAC path already does with 28-token windows):

1. When DECODE_INTERVAL (24) new semantic tokens arrive, decode ONLY a window of the last W tokens (e.g., W=48: 24 new + 24 overlap)
2. The overlap ensures the BiCodec's non-causal ConvNeXt layers have context
3. Extract only the middle portion (the 24 new tokens worth of audio)
4. Crossfade with previous chunk boundary

**Why this works**: BiCodec's prenet uses ConvNeXt blocks (kernel size likely 7), so the receptive field is limited. A window of 48 tokens provides ~24 tokens of context on each side -- sufficient for the convolutions to produce stable output.

**Impact**: Decode time becomes O(n) instead of O(n^2). Each decode processes a fixed ~48 tokens instead of growing to 500+. This directly reduces streaming latency and frees GPU cycles for more concurrent requests.

#### 2B. Reduce `gpu_memory_utilization` from 0.85 to 0.5

The profiler showed: "Available KV cache memory: 65.61 GiB" = room for 1,399 concurrent sequences at max_model_len=4096. We will never need that many. Each sequence only uses ~48 MB of KV cache.

At `gpu_memory_utilization=0.5`:

- vLLM gets ~40 GB
- KV cache available: ~37 GB (after model weights)
- Max concurrent sequences: ~770 (still 15x more than we need)
- Frees ~28 GB for BiCodec, PyTorch overhead, and future models

**Secondary benefit**: Smaller memory pool means faster memory management, potentially improving scheduling latency.

#### 2C. Pass `prompt_token_ids` instead of string prompts to vLLM

Currently [pipeline.py](veena3modal/core/pipeline.py) line 104 passes `prompt=prompt` (string). vLLM then tokenizes it internally. We already have the tokenizer -- pre-tokenize once and pass token IDs:

```python
prompt_ids = self.model.tokenizer.encode(prompt, add_special_tokens=False)
results_generator = self.model.engine.generate(
    prompt={"prompt_token_ids": prompt_ids},  # Skip internal tokenization
    sampling_params=sampling_params,
    request_id=request_id,
)
```

**Impact**: Saves 1-5ms per request (tokenization overhead), but more importantly avoids tokenizer contention under concurrent load.

---

### TIER 3: Architecture-level optimizations (highest impact, most effort)

#### 3A. Move BiCodec decode to CPU with async execution

**Key insight from profiling**: BiCodec decode is 18-42ms on GPU. The WaveGenerator (HiFi-GAN variant) is Conv1d-heavy and parallelizes well, but it also runs perfectly fine on CPU.

**Proposal**: 

1. Load BiCodec model on CPU instead of GPU
2. Run BiCodec decode in a `ThreadPoolExecutor` to avoid blocking the async event loop
3. This completely eliminates GPU contention between vLLM and BiCodec

**Expected CPU decode time**: 80-200ms per call (3-5x slower than GPU). BUT: with fixed-window decode (Tier 2A), each call processes only ~48 tokens, making CPU decode ~50-100ms.

**Trade-off**: Slightly higher per-decode latency, but vLLM gets 100% of GPU cycles. With 20 concurrent streams, vLLM's token generation rate improves because it no longer shares GPU with BiCodec decode calls. Net effect on TTFB should be positive.

#### 3B. BiCodec batched decode for concurrent streams

When multiple concurrent streams trigger decode at the same time, batch them into a single GPU forward pass:

```python
# Instead of N separate calls:
for req in concurrent_requests:
    audio = bicodec.detokenize(req.semantic, req.global)  # Sequential

# Batch them:
batched_semantic = pad_and_stack([r.semantic for r in concurrent_requests])
batched_global = pad_and_stack([r.global for r in concurrent_requests])
batched_audio = bicodec.detokenize(batched_semantic, batched_global)  # One GPU call
```

The BiCodec model already supports batch dimensions. The WaveGenerator is purely convolutional and benefits from batching. This amortizes GPU kernel launch overhead across N decodes.

**Impact**: N concurrent decodes of 20ms each = 20ms batched (not N*20ms sequential).

#### 3C. Pre-generate global tokens per speaker at startup

Global tokens encode speaker identity. While they are not purely deterministic (they depend on text/prosody), we can generate a "reference" set of 32 global tokens per speaker at startup and use them as pre-fill for ALL requests to that speaker. This is exactly what `build_prefix_with_globals()` already does for chunk continuation.

**Proposal**:

1. At startup, for each of the 12 speakers, generate one reference utterance
2. Cache the 32 global tokens per speaker
3. For all streaming requests, use `build_prefix_with_globals()` instead of `build_prefix()`
4. The model skips global token generation (saves ~100ms at 1c, much more under concurrency) and goes straight to semantic tokens

**Trade-off**: Slight voice quality variation (global tokens won't be text-specific). But the continuation path already does this for chunks 2+ and it works well. The user can opt out with a flag.

**Impact**: Eliminates the global token generation phase entirely. Streaming TTFB drops by ~110ms at 1c, and under concurrency the effect is much larger because 32 fewer tokens per request means less vLLM scheduling contention.

#### 3D. Reduce DECODE_INTERVAL and MIN_SEMANTIC_FOR_FIRST_CHUNK

Currently: first chunk at 16 semantic tokens, subsequent chunks every 24 tokens. With fixed-window decode (2A), each decode is cheap (~constant time).

**Proposal**: Reduce to `MIN_SEMANTIC_FOR_FIRST_CHUNK=8` and `DECODE_INTERVAL=12`:

- First audio at 8 tokens = ~160ms of audio, decoded in ~15ms
- Subsequent chunks every 12 tokens = ~240ms per chunk
- More frequent, smaller chunks = smoother streaming and lower perceived TTFB

---

### TIER 4: Advanced / longer-term

#### 4A. TensorRT for BiCodec decode path

The detokenize path is purely feedforward (Conv1d + ConvTranspose1d + activations). Export to ONNX, optimize with TensorRT:

- Fuse Conv1d + activation ops
- FP16/INT8 quantization of the vocoder
- Expected 2-4x speedup on the decode step

#### 4B. Explicit `max_num_seqs` in vLLM config

vLLM defaults `max_num_seqs` based on available memory. Explicitly set it to the target concurrency (e.g., 128) to ensure the scheduler allocates resources appropriately:

```python
VLLM_CONFIG["max_num_seqs"] = 128
```

#### 4C. CUDA Stream separation

Run BiCodec decode on a separate CUDA stream from vLLM. This allows true GPU parallelism -- vLLM decode steps and BiCodec vocoder run simultaneously on different SMs.

---

## Expected Impact Summary


| Optimization               | TTFB at 1c | TTFB at 20c    | Max Concurrent Streams |
| -------------------------- | ---------- | -------------- | ---------------------- |
| **Baseline**               | 336ms      | 1719ms         | ~5-10                  |
| + 1A Parser singleton      | -123ms     | -123ms         | same                   |
| + 1B Remove .to(device)    | -2ms       | -2ms           | same                   |
| + 2A Fixed-window decode   | -50ms      | -200ms         | +20%                   |
| + 2C Token IDs to vLLM     | -5ms       | -20ms          | +5%                    |
| + 3A BiCodec on CPU        | +30ms      | -500ms         | +100%                  |
| + 3C Pre-gen globals       | -110ms     | -800ms         | +50%                   |
| + 3D Lower decode interval | -80ms      | -80ms          | same                   |
| + 2B Lower GPU mem util    | 0          | -100ms         | +30%                   |
| **Combined estimate**      | **~100ms** | **~300-400ms** | **40-80 streams**      |


## Implementation Order

Tier 1 fixes first (1A, 1B, 1C, 1D) -- zero risk, immediate gains. Then 2A (biggest algorithmic win), then 3C + 3A together (biggest architectural win for concurrency). Re-profile after each tier to validate.