---
name: TTS Pipeline Optimization
overview: "Comprehensive optimization strategy to reduce streaming TTFB from ~336ms to under 100ms at 1 concurrent, maintain sub-500ms TTFB at 50 concurrent, and increase per-GPU streaming throughput from ~5 req/s to 30+ req/s. The plan is structured as 3 tiers: easy wins (hours), medium effort (days), and architectural changes (weeks)."
todos:
  - id: tier1-stop-token
    content: Fix SNAC legacy stop token in streaming pipeline (CODE_END_TOKEN_ID -> TRAINING_STOP_TOKEN_IDS)
    status: completed
  - id: tier1-singleton-parser
    content: Make BiCodecTokenParser a singleton on the pipeline class (save 123ms per TTFB)
    status: completed
  - id: tier1-lower-min-semantic
    content: Lower MIN_SEMANTIC_FOR_FIRST_CHUNK from 16 to 10
    status: completed
  - id: tier1-fix-async
    content: Fix decode_single_async to use run_in_executor instead of blocking
    status: completed
  - id: tier1-remove-redundant-to
    content: Remove redundant .to(device) in BiCodecDecoder.decode()
    status: completed
  - id: tier1-reduce-gpu-mem
    content: Reduce gpu_memory_utilization from 0.85 to 0.25 in VLLM_CONFIG
    status: in_progress
  - id: tier1-verify
    content: Re-run profiler and stress test to measure Tier 1 impact
    status: pending
  - id: tier2-precompute-dvectors
    content: Pre-compute and cache d_vectors for all 12 speakers at startup
    status: pending
  - id: tier2-windowed-decode
    content: Implement windowed BiCodec decode to eliminate O(n) re-decode
    status: pending
  - id: tier2-batch-decode
    content: Implement BiCodec batch decoder for concurrent requests
    status: pending
  - id: tier2-torch-compile
    content: Apply torch.compile to BiCodec WaveGenerator
    status: pending
  - id: tier3-dual-engine
    content: Design dual vLLM engine architecture for prefill/decode separation
    status: pending
isProject: false
---

# TTS Pipeline Optimization: TTFB and Concurrency

## Current State (Profiled on A100-80GB)

The system uses a 0.5B Qwen2 LLM (~~1.3GB) + ~80M BiCodec decoder (~~160MB) but consumes 72GB of 80GB VRAM because vLLM pre-allocates 65.6GB of KV cache (enough for 1,399 concurrent sequences when peak actual need is ~100). Streaming tops out at ~5 req/s with TTFB degrading from 336ms at 1 concurrent to 1719ms at 20 concurrent.

```mermaid
flowchart LR
    subgraph currentBottlenecks [Current Bottlenecks]
        A["vLLM Prefill\nContention"] --> B["TTFB grows\nlinearly with\nconcurrency"]
        C["O(n) Full\nRe-Decode"] --> D["GPU wasted\non redundant\nBiCodec calls"]
        E["Wrong Stop\nToken"] --> F["Wasted tokens\nafter EOS"]
        G["Parser Init\n123ms/request"] --> H["Flat TTFB\npenalty"]
        I["65GB KV Cache\nfor 100 users"] --> J["Cannot fit\non smaller GPU"]
    end
```


---

## Tier 1: Easy Wins (implement in hours, ~200ms TTFB reduction)

### 1.1 Fix the wrong stop token (CRITICAL BUG)

The streaming pipeline uses `stop_token_ids=[CODE_END_TOKEN_ID]` where `CODE_END_TOKEN_ID = 128258` is a legacy SNAC constant. The Spark TTS model uses `<|im_end|>` as its stop token. If token ID 128258 doesn't exist in the Spark TTS vocab, the stop condition **never fires** and generation runs until `max_tokens=4096`, wasting hundreds of milliseconds generating garbage tokens.

- File: [veena3modal/core/streaming_pipeline.py](veena3modal/core/streaming_pipeline.py) lines 252-253
- Change: Replace `stop_token_ids=[CODE_END_TOKEN_ID]` with `stop=TRAINING_STOP_TOKEN_IDS` (same as `pipeline.py` line 94)
- Apply to all 3 streaming methods: `generate_speech_stream_indic`, `_first_chunk`, `_continuation`
- Expected impact: Eliminates wasted token generation, saves 100-500ms per request depending on how many tokens are currently wasted

### 1.2 Singleton BiCodecTokenParser (save 123ms per TTFB)

A new `BiCodecTokenParser` is created per-request in the streaming pipeline, and `_prewarm_cache()` iterates the full 166K-entry vocabulary each time.

- File: [veena3modal/core/streaming_pipeline.py](veena3modal/core/streaming_pipeline.py) line 282
- Change: Create the parser once in `Veena3SlidingWindowPipeline.__init__()` and store as `self.token_parser`. Each streaming call reuses it (the parser is stateless after init -- the `parse()` method only reads `_cache`).
- Expected impact: -123ms from every streaming TTFB

### 1.3 Lower MIN_SEMANTIC_FOR_FIRST_CHUNK from 16 to 10

The BiCodec decoder accepts as few as 8 semantic tokens ([bicodec_decoder.py](veena3modal/core/bicodec_decoder.py) line 242). The pipeline's minimum of 16 adds an unnecessary ~120ms buffer at 50 TPS.

- File: [veena3modal/core/streaming_pipeline.py](veena3modal/core/streaming_pipeline.py) line 264
- Change: Lower to 10 (safe margin over decoder's 8-token floor)
- Expected impact: -120ms from streaming TTFB

### 1.4 Fix decode_single_async blocking the event loop

The `async` method calls BiCodec decode synchronously, blocking ALL concurrent coroutines during GPU compute (~20-50ms per call).

- File: [veena3modal/core/bicodec_decoder.py](veena3modal/core/bicodec_decoder.py) line 284-309
- Change: Wrap the sync call in `asyncio.get_event_loop().run_in_executor(None, self.decode_streaming, ...)`
- Expected impact: Unblocks concurrent streams during BiCodec decode, improving p95 latency under load

### 1.5 Remove redundant `.to(device)` on every decode call

Lines 119-120 of [bicodec_decoder.py](veena3modal/core/bicodec_decoder.py) call `.to(self.device)` on every decode. Move to `__init__` only.

### 1.6 Reduce gpu_memory_utilization from 0.85 to 0.25

The model weighs ~1.3GB but vLLM pre-allocates 65.6GB for KV cache (enough for 1,399 concurrent sequences). At 0.25, you get:

- KV cache: ~18GB (supports ~380 concurrent sequences, still 4x your peak of 100)
- Freed: ~48GB VRAM
- Enables fitting on L4 (24GB) or even T4 (16GB) GPUs
- File: [veena3modal/core/constants.py](veena3modal/core/constants.py) line 143
- Change: `"gpu_memory_utilization": 0.25`

**Tier 1 combined expected impact: TTFB drops from ~336ms to ~100ms at 1 concurrent.**

---

## Tier 2: Medium Effort (days, 3-5x concurrency improvement)

### 2.1 Pre-compute d_vectors for all 12 speakers (eliminate global pre-roll)

The profiling shows 110ms spent waiting for 32 global token generation before ANY audio can stream. But there are only 12 speakers, and the global tokens deterministically encode speaker identity via FSQ quantization.

At startup, for each speaker:

1. Generate one utterance to capture the 32 global tokens
2. Run `speaker_encoder.detokenize(global_tokens)` to get the d_vector (shape: `(1, 1024)`)
3. Cache the d_vector in a dict: `{speaker_name: d_vector_tensor}`

At streaming time:

- Skip global token generation entirely by injecting the cached global tokens into the prompt (using `build_prefix_with_globals`)
- OR better: bypass the LLM's global token generation and directly use the cached d_vector in BiCodec decode

This eliminates the 110ms global token pre-roll phase from streaming TTFB.

- Files: [tts_runtime.py](veena3modal/services/tts_runtime.py), [streaming_pipeline.py](veena3modal/core/streaming_pipeline.py)
- New: Add a `_speaker_cache: Dict[str, Tuple[List[int], torch.Tensor]]` to the runtime

### 2.2 Windowed BiCodec decode (eliminate O(n) re-decode)

Currently, each streaming decode call re-decodes ALL accumulated semantic tokens. For a 10-second utterance (~500 tokens, ~20 decode calls), this means ~12x redundant GPU work.

The BiCodec decoder architecture (quantizer -> prenet -> WaveGenerator) is a 1D convolutional chain with receptive fields defined by the kernel sizes. The WaveGenerator uses `kernel_sizes=[16,11,8,4]` which gives a finite receptive field.

Proposed approach:

- Determine the effective receptive field of the decoder in semantic token units (likely ~32-64 tokens based on kernel sizes and upsampling ratios)
- Use a sliding window of `receptive_field + DECODE_INTERVAL` tokens
- Crossfade the overlap between consecutive windows (already implemented)
- This converts O(n) total decode work to O(1) per chunk

Alternatively, if windowed decode produces artifacts:

- Increase DECODE_INTERVAL from 24 to 48 or 96 (halving/quartering decode calls)
- This is a simple constant change that trades chunk granularity for reduced GPU load
- File: [streaming_pipeline.py](veena3modal/core/streaming_pipeline.py) lines 340-397

### 2.3 BiCodec batch decode across concurrent requests

The BiCodec decoder supports batching (all ops are batch-compatible). Currently each concurrent request decodes independently, serialized by the GIL.

Proposed approach:

- Create a `BiCodecBatchDecoder` that collects decode requests from multiple streams
- Batch them into a single GPU forward pass (pad semantic sequences to equal length)
- Dispatch results back to individual streams

This turns N sequential ~30ms decodes into 1 batched ~40ms decode.

- New file: `veena3modal/core/bicodec_batch_decoder.py`
- Requires: asyncio Queue + background worker pattern

### 2.4 torch.compile the BiCodec decoder

The BiCodec WaveGenerator (~50M params) runs with dynamic shapes but the global tokens are always exactly 32. Compiling with `torch.compile(mode="reduce-overhead")` can fuse operations and eliminate Python overhead.

```python
# In BiCodecDecoder.__init__:
self.audio_tokenizer.model = torch.compile(
    self.audio_tokenizer.model,
    mode="reduce-overhead",  # Best for small models with frequent calls
    fullgraph=True,
)
```

Expected: 20-40% speedup on BiCodec decode (from ~30ms to ~18-24ms per call).

---

## Tier 3: Architectural Changes (weeks, 10x+ improvement potential)

### 3.1 Dual vLLM engine instances

With gpu_memory_utilization reduced to 0.25, there's ~48GB of free VRAM. Run a **second vLLM engine instance** on the same GPU:

- Engine A: Handles prefill (prompt encoding) -- prioritizes TTFB
- Engine B: Handles decode (token generation) -- prioritizes throughput
- OR: Both handle full requests, doubling effective concurrency

This directly addresses the #1 bottleneck: prefill contention. When 20 streams compete for prefill, currently they queue. With 2 engines, prefill capacity doubles.

Memory budget: 2 x (1.3GB model + 4GB KV cache) = ~10.6GB. Leaves ~65GB free.

### 3.2 Speculative decoding for TTS

TTS token sequences are highly predictable (prosody follows patterns). A tiny draft model (even n-gram) could propose 4-8 tokens per step, with the main model verifying in a single forward pass. This multiplies effective tokens/sec by the acceptance length.

vLLM supports speculative decoding natively via `speculative_model` config.

### 3.3 INT8/FP8 BiCodec decoder

Quantize the WaveGenerator to INT8 (the largest component at ~50M params). The convolutional decoder is robust to quantization. This halves memory and typically gives 30-50% speedup on Ampere GPUs.

```python
from torch.ao.quantization import quantize_dynamic
quantized_decoder = quantize_dynamic(
    self.audio_tokenizer.model.decoder,
    {torch.nn.Conv1d, torch.nn.ConvTranspose1d},
    dtype=torch.qint8,
)
```

### 3.4 Multi-GPU horizontal scaling

With the model fitting in ~3-5GB, a single A100-80GB could theoretically run **8-16 independent TTS instances** using CUDA MPS (Multi-Process Service) or simply separate processes with memory limits. Each instance handles its own request queue.

---

## Priority Matrix

```mermaid
quadrantChart
    title Impact vs Effort
    x-axis Low Effort --> High Effort
    y-axis Low Impact --> High Impact
    quadrant-1 Do First
    quadrant-2 Plan Carefully
    quadrant-3 Quick Wins
    quadrant-4 Deprioritize
    "Fix stop token": [0.1, 0.85]
    "Singleton parser": [0.1, 0.6]
    "Lower first chunk min": [0.05, 0.4]
    "Fix async blocking": [0.15, 0.5]
    "Reduce GPU mem util": [0.1, 0.7]
    "Pre-compute d_vectors": [0.35, 0.75]
    "Windowed decode": [0.5, 0.8]
    "Batch BiCodec": [0.55, 0.6]
    "torch.compile BiCodec": [0.2, 0.35]
    "Dual vLLM engines": [0.7, 0.9]
    "Speculative decode": [0.8, 0.7]
    "INT8 BiCodec": [0.45, 0.3]
```


## Expected Outcome After All Tiers


| Metric                   | Current   | After Tier 1 | After Tier 2 | After Tier 3 |
| ------------------------ | --------- | ------------ | ------------ | ------------ |
| TTFB (1 concurrent)      | 336ms     | ~100ms       | ~50ms        | ~30ms        |
| TTFB (20 concurrent)     | 1719ms    | ~800ms       | ~200ms       | ~100ms       |
| Streaming throughput     | ~5 req/s  | ~8 req/s     | ~20 req/s    | ~50+ req/s   |
| Non-streaming throughput | ~29 req/s | ~35 req/s    | ~45 req/s    | ~80+ req/s   |
| Min GPU required         | 80GB      | 24GB         | 16GB         | 8GB (INT8)   |