# Veena3 TTS Modal Autoscaling Analysis

**Date**: December 25, 2025  
**Endpoint**: `https://mayaresearch--veena3-tts-ttsservice-serve.modal.run`  
**GPU**: NVIDIA L40S (48GB VRAM)

---

## Executive Summary

After comprehensive stress testing, the Veena3 TTS service demonstrates **exceptional scaling characteristics**:

| Metric | Value |
|--------|-------|
| **Max Sustainable RPS** | 20.35 requests/second (per container) |
| **Optimal Concurrency** | 32 concurrent requests |
| **Success Rate** | 100% (up to 32 concurrent) |
| **Latency (p50)** | 580-1372ms depending on load |
| **TTFB** | ~450-520ms (warm container) |
| **Cold Start** | ~52s (model loading + snapshot) |
| **RTF (Real-Time Factor)** | 0.19-0.30 (generates audio 3-5x faster than playback) |

**Key Finding**: A single L40S GPU container can handle **20+ requests per second** with 100% success rate. The vLLM continuous batching works excellently.

---

## GPU Selection Analysis

### Why L40S (Current Choice)

| Aspect | L40S | A100-40GB | A100-80GB | H100 |
|--------|------|-----------|-----------|------|
| **VRAM** | 48GB | 40GB | 80GB | 80GB |
| **FP16 TFLOPs** | ~181 | ~312 | ~312 | ~989 |
| **Modal Price** | ~$1.70/hr | ~$2.78/hr | ~$3.72/hr | ~$4.16/hr |
| **Best For** | Inference | Training/Inference | Large Models | Peak Performance |

**L40S is ideal for Veena3 because:**
1. **48GB VRAM** - Plenty for Spark TTS (~8GB model) + vLLM KV cache + BiCodec + SR
2. **Cost-efficient** - ~40% cheaper than A100-40GB per hour
3. **Inference-optimized** - Ada Lovelace architecture excels at inference
4. **Availability** - Better availability than A100/H100 in Modal

### When to Consider A100 or H100

- **A100-80GB**: If we need to serve multiple models or significantly larger vLLM context
- **H100**: For 2x throughput if cost is no object and latency is critical
- **Multi-GPU**: Not needed unless serving models >48GB

**Recommendation**: Stay with L40S. Our benchmarks show it handles 20+ RPS per container, which is excellent.

---

## Stress Test Results

### Test Configuration

- **Text lengths**: Short (~5 words), Medium (~25 words), Long (~60 words)
- **Concurrency levels**: 1, 2, 4, 6, 8, 10, 12, 16, 20, 32
- **Requests per test**: 5-96 depending on concurrency
- **Format**: WAV (non-streaming, which is more demanding)

### Results by Concurrency Level

| Concurrency | Success | RPS | p50 (ms) | p95 (ms) | TTFB (ms) | RTF |
|-------------|---------|-----|----------|----------|-----------|-----|
| 1 | 100% | 1.14 | 589 | 1819 | 576 | 0.30 |
| 2 | 100% | 3.03 | 580 | 864 | 451 | 0.24 |
| 4 | 100% | 5.75 | 613 | 821 | 468 | 0.24 |
| 6 | 100% | 8.05 | 670 | 865 | 488 | 0.25 |
| 8 | 100% | 9.96 | 634 | 861 | 477 | 0.25 |
| 10 | 100% | 11.58 | 733 | 1349 | 492 | 0.25 |
| 12 | 100% | 14.14 | 686 | 1347 | 487 | 0.25 |
| 16 | 100% | 16.26 | 824 | 1346 | 498 | 0.26 |
| 20 | 100% | 18.43 | 831 | 1485 | 515 | 0.26 |
| **32** | **100%** | **20.35** | 1372 | 1647 | 521 | 0.26 |

### Key Observations

1. **Linear scaling up to 8 concurrent**: RPS increases nearly linearly with concurrency
2. **Diminishing returns after 16**: RPS growth slows but still increases
3. **100% success at 32 concurrent**: vLLM batching handles high load gracefully
4. **TTFB stable**: ~450-520ms regardless of load (excellent)
5. **p95 latency doubles at 32**: Expected queueing behavior, not OOM

### Longer Text Performance

| Text Type | Words | Concurrency | RPS | p50 (ms) | TTFB (ms) | RTF |
|-----------|-------|-------------|-----|----------|-----------|-----|
| Short | ~5 | 8 | 9.96 | 634 | 477 | 0.25 |
| Medium | ~25 | 8 | 4.11 | 1907 | 1511 | 0.19 |
| Long | ~60 | 8 | 1.98 | 3891 | 3436 | 0.19 |

**Insight**: Longer text = lower RPS but better RTF (more efficient audio generation per GPU cycle).

---

## Autoscaling Configuration

### Current Configuration

```python
@app.cls(
    gpu="L40S",
    min_containers=0,        # Scale to zero when idle
    buffer_containers=1,     # Keep 1 warm container ready
    scaledown_window=300,    # Wait 5 min before scaling down
    timeout=600,             # 10 min max per request
    startup_timeout=1200,    # 20 min for model loading
    enable_memory_snapshot=True,
)
@modal.concurrent(max_inputs=8, target_inputs=4)
class TTSService:
    ...
```

### Recommended Configuration (Based on Analysis)

```python
@app.cls(
    gpu="L40S",
    min_containers=0,        # Scale to zero when idle (cost savings)
    buffer_containers=1,     # Keep 1 warm container during active use
    scaledown_window=300,    # 5 min idle before scale-down
    timeout=600,             # 10 min max per request
    startup_timeout=1200,    # 20 min for model loading
    enable_memory_snapshot=True,  # Faster cold starts
)
@modal.concurrent(max_inputs=16, target_inputs=8)  # INCREASED from 8/4
class TTSService:
    ...
```

### Parameter Explanations

#### `min_containers=0`
- **Why**: Cost optimization - no charges when idle
- **Tradeoff**: Cold start penalty (~52s) for first request after idle
- **Alternative**: Set `min_containers=1` for production if cold start is unacceptable

#### `buffer_containers=1`
- **Why**: Keeps 1 extra container ready while Function is active
- **When it helps**: Handles traffic bursts without queueing
- **Cost**: Only active while Function is receiving traffic

#### `scaledown_window=300` (5 minutes)
- **Why**: Prevents rapid scale up/down during bursty traffic
- **Tradeoff**: Slightly higher cost vs faster scale-down
- **Recommendation**: Keep at 5 minutes for typical TTS usage patterns

#### `max_inputs=16` (INCREASED from 8)
- **Why**: Benchmarks show 100% success at 16 concurrent with good latency
- **Effect**: Each container can handle up to 16 simultaneous requests
- **Protection**: Prevents OOM by capping concurrent GPU work

#### `target_inputs=8` (INCREASED from 4)
- **Why**: Autoscaler aims for 8 concurrent per container before scaling
- **Effect**: New containers spin up when existing hit 8 concurrent
- **Latency**: Keeps p50 latency under 700ms even at target

---

## Scaling Triggers

### When Modal Adds a New Container

1. **Pending inputs > available capacity**: If all containers are at `max_inputs` and more requests arrive
2. **Buffer depletion**: When `buffer_containers` worth of capacity is consumed
3. **Load prediction**: Modal's autoscaler predicts increased demand

### When Modal Removes a Container

1. **No inputs for `scaledown_window`**: Container idle for 5 minutes
2. **Over-provisioned**: More containers than needed for current load
3. **Down to `min_containers`**: Won't go below minimum

### Scaling Example

**Scenario**: 50 concurrent requests arrive

| State | Containers | Requests/Container | Action |
|-------|------------|-------------------|--------|
| Initial | 1 | 0 | Receive traffic |
| T+0ms | 1 | 16 (max) | Queue 34, spin up more |
| T+1s | 2 | 16, 8 | Still scaling |
| T+3s | 4 | 16, 16, 16, 2 | Stable |
| T+5min idle | 1 | 0 | Scale down |

---

## Cold Start Analysis

### Cold Start Breakdown (measured: ~52 seconds)

| Phase | Time | Description |
|-------|------|-------------|
| Container boot | ~1s | Modal's fast container startup |
| Python imports | ~3s | torch, vllm, transformers |
| vLLM engine init | ~15s | Load model weights, compile |
| BiCodec init | ~8s | Load audio tokenizer |
| SR model init | ~5s | Load super-resolution model |
| Warmup | ~5s | First inference to warm GPU |
| **Total** | **~52s** | First request only |

### Memory Snapshots (Enabled)

With `enable_memory_snapshot=True`:
- **First deployment**: Full cold start (~52s)
- **Subsequent cold starts**: ~15-20s (snapshot restore)
- **Savings**: ~60% reduction in cold start time

### Cold Start Mitigation Strategies

1. **`min_containers=1`**: Eliminates cold starts entirely (costs ~$1.70/hr)
2. **`buffer_containers=1`**: Ensures warm container during active use
3. **Memory snapshots**: Already enabled, reduces cold start by ~60%
4. **GPU snapshots (alpha)**: Could further reduce, but experimental

---

## Cost Analysis

### Per-Container Costs (L40S)

| Usage Pattern | Containers | Monthly Cost (Est.) |
|---------------|------------|---------------------|
| Scale to zero, occasional use | 0 (idle) | ~$50-100 |
| 1 always warm | 1 | ~$1,250 |
| Production (min=1, buffer=1) | 1-2 | ~$1,500-2,500 |
| High load (avg 3 containers) | 2-4 | ~$2,500-5,000 |

### Cost per Request

At 20 RPS with 100% utilization:
- **Requests per hour**: 72,000
- **GPU cost per hour**: ~$1.70
- **Cost per 1000 requests**: ~$0.024 (~2.4 cents)

---

## Recommendations

### For Development/Testing

```python
min_containers=0       # Save costs
buffer_containers=0    # Accept cold starts
scaledown_window=120   # Faster scale-down
max_inputs=8           # Conservative
target_inputs=4        # Low queueing
```

### For Production (Cost-Optimized)

```python
min_containers=0       # Scale to zero
buffer_containers=1    # 1 warm during use
scaledown_window=300   # 5 min idle
max_inputs=16          # Handle bursts
target_inputs=8        # Balance latency
```

### For Production (Latency-Optimized)

```python
min_containers=1       # Always 1 warm
buffer_containers=2    # Extra capacity
scaledown_window=600   # 10 min idle
max_inputs=12          # Lower for latency
target_inputs=6        # More containers
```

---

## Monitoring

### Key Metrics to Track

1. **Request latency** (p50, p95, p99)
2. **TTFB** (time to first byte)
3. **Container count** (active, buffer, idle)
4. **Queue depth** (pending inputs)
5. **Error rate** (should be 0%)
6. **Cold start frequency**

### Alerting Thresholds

| Metric | Warning | Critical |
|--------|---------|----------|
| p95 latency | > 3s | > 5s |
| Error rate | > 0.1% | > 1% |
| Queue depth | > 50 | > 100 |
| Cold starts/hour | > 5 | > 10 |

---

## Appendix: Test Script

The full stress test script is located at:
```
veena3modal/tests/modal_live/test_scaling_analysis.py
```

Run it with:
```bash
export MODAL_ENDPOINT_URL="https://mayaresearch--veena3-tts-ttsservice-serve.modal.run"
python veena3modal/tests/modal_live/test_scaling_analysis.py
```

Results are saved to `/tmp/scaling_analysis_<timestamp>.json`

---

## Conclusion

The Veena3 TTS service on Modal L40S demonstrates excellent scaling characteristics:

1. **Single container handles 20+ RPS** with 100% success rate
2. **vLLM continuous batching** works efficiently for concurrent requests
3. **L40S is the right GPU choice** - cost-effective with sufficient VRAM
4. **Increase `max_inputs` to 16** and `target_inputs` to 8 for optimal performance
5. **Memory snapshots** reduce cold start time by ~60%

The recommended configuration balances cost and performance for typical production workloads.