# GPU Scheduling Strategy for Veena3 TTS

## Overview

This document outlines the GPU scheduling strategy for the Veena3 TTS Modal deployment.

## Current Configuration (Dec 2025)

```python
min_containers=1       # 1 GPU always warm
max_containers=20      # 20 GPUs max (~300 req/s)
buffer_containers=1    # 1 buffer (fast cold boot)
scaledown_window=120   # 2 min idle before release

# With memory snapshots:
enable_memory_snapshot=True  # ~10s cold boot
```

## Key Performance Metrics

| Metric | Value |
|--------|-------|
| Cold boot (with snapshot) | ~10 seconds |
| Cold boot (without snapshot) | ~50-60 seconds |
| Capacity per container | ~15 req/s sustained |
| GPU cost | $1.95/hr per L40S |
| Concurrent requests per GPU | 8-12 (vLLM batching) |

---

## Scaling Behavior

### Scale-Up (Load Increases)

```
Trigger: queue_depth > target_inputs (8)
Action: Start new container
Time to ready: ~10s (memory snapshot)
```

### Scale-Down (Load Decreases)

```
Trigger: container idle for scaledown_window (120s)
Action: Release container
Grace period: 2 minutes
```

---

## Traffic Scenarios

### Scenario A: Gradual Increase (2 → 200 req/s over 10 minutes)

| Time | Traffic | Containers | Notes |
|------|---------|------------|-------|
| T+0 | 2 req/s | 1+1 buffer | Baseline |
| T+2m | 20 req/s | 2 | Buffer consumed, new buffer warming |
| T+4m | 50 req/s | 4 | Gradual scale |
| T+6m | 100 req/s | 7 | Autoscaler catching up |
| T+8m | 150 req/s | 10 | Stable |
| T+10m | 200 req/s | 14 | Peak |

**Result:** Smooth scaling, no dropped requests

### Scenario B: Sudden Spike (2 → 100 req/s instant)

| Time | Traffic | Containers | Latency Impact |
|------|---------|------------|----------------|
| T+0 | 2 → 100 | 1+1 buffer | Queue builds |
| T+1s | 100 | 2 | Buffer activated |
| T+10s | 100 | 7 | 5 new containers ready |
| T+15s | 100 | 7 | Stable |

**Result:** 10-15s of elevated latency, then stable

### Scenario C: Traffic Drop (200 → 10 req/s)

| Time | Containers | Status |
|------|------------|--------|
| T+0 | 14 active | Load drops |
| T+30s | 14 (13 idle) | Idle countdown |
| T+2m | 14 → 2 | 12 containers released |
| T+3m | 2 (1+1 buffer) | Baseline |

**Result:** 12 GPUs released, saving ~$23/hr

---

## Capacity Planning

| Target Load | Containers | Monthly Cost | Buffer |
|-------------|------------|--------------|--------|
| 15 req/s | 1+1 | ~$2,850 | Good |
| 50 req/s | 4+1 | ~$7,100 | Good |
| 100 req/s | 7+1 | ~$11,400 | Good |
| 200 req/s | 14+1 | ~$21,400 | Tight |
| 300 req/s | 20+1 | ~$30,000 | At max |
| 500+ req/s | Need to increase max | - | - |

---

## Why These Settings?

### buffer_containers=1 (not 2)

- **With snapshot:** Cold boot is ~10s
- **User experience:** 10s wait is acceptable for burst
- **Cost:** Saves 1 GPU × $1.95/hr = $1,423/month
- **When to increase:** If cold boot >30s or SLA requires <5s

### scaledown_window=120 (not 180)

- **Response time:** Still 2 minutes of buffer
- **Cost savings:** Faster GPU release during drops
- **Risk:** Slightly more cold boots during oscillating traffic
- **When to increase:** If traffic is very bursty (up/down within seconds)

### max_containers=20 (not 10)

- **Capacity:** Handles 300 req/s (doubled from 150)
- **Cost control:** Still capped, prevents runaway
- **Burst handling:** Can absorb 3x normal peak
- **When to increase:** If hitting 300 req/s regularly

---

## Monitoring Alerts (Recommended)

```yaml
alerts:
  - name: "High Container Utilization"
    condition: active_containers / max_containers > 0.8
    action: "Consider increasing max_containers"
  
  - name: "Frequent Cold Boots"
    condition: cold_boots_per_hour > 10
    action: "Consider increasing buffer_containers"
  
  - name: "High Queue Depth"
    condition: avg_queue_depth > target_inputs
    action: "Autoscaler may be slow, check logs"
```

---

## Extreme Load Handling (10,000 req/s)

If you ever need to handle 10,000 req/s:

1. **Increase max_containers:** 667 containers needed
2. **Multi-region:** Deploy in multiple regions
3. **Request queuing:** Implement async queue (SQS/Redis)
4. **Rate limiting:** Graceful degradation for overflow
5. **CDN caching:** Cache repeated requests if possible

**Cost at 10,000 req/s:** ~$1,300/hr or ~$950k/month

---

## Decision Tree

```
Is traffic predictable?
├── Yes → Use scheduled scaling (Modal Cron to pre-warm)
└── No → Use reactive autoscaling (current)

Is latency critical (<1s)?
├── Yes → buffer_containers=2, min_containers=2
└── No → buffer_containers=1, min_containers=1

Is cost critical?
├── Yes → scaledown_window=60, buffer_containers=0
└── No → scaledown_window=120, buffer_containers=1

Expected peak load?
├── <100 req/s → max_containers=10
├── <300 req/s → max_containers=20
└── >300 req/s → max_containers=50+ or multi-region
```

---

## Quick Reference

| Setting | Value | Why |
|---------|-------|-----|
| `min_containers` | 1 | Always instant response |
| `max_containers` | 20 | ~300 req/s capacity |
| `buffer_containers` | 1 | Fast cold boot (10s) |
| `scaledown_window` | 120 | Balance cost/responsiveness |
| `enable_memory_snapshot` | True | 5x faster cold boot |
| `max_inputs` | 12 | GPU saturation point |
| `target_inputs` | 8 | Scale-up trigger |

Last updated: Dec 26, 2025