---
name: Direct WebSocket Capacity Test
overview: Bypass RunPod's HTTP LB proxy for WebSocket connections by exposing worker ports directly, then run proper per-worker GPU capacity testing with direct WS access.
todos:
  - id: cleanup
    content: Delete old log files and unused test scripts
    status: pending
  - id: server-changes
    content: Add /connect endpoint + fix GPU monitoring in server.py, update Dockerfile port
    status: pending
  - id: docker-build-push
    content: Build v3-direct Docker image and push to GHCR
    status: pending
  - id: template-deploy
    content: Update RunPod template with new image and exposed ports, restart worker
    status: pending
  - id: verify-direct-ws
    content: Verify /connect returns direct WS URL, smoke test direct WS connection
    status: pending
  - id: update-test
    content: Update worker_capacity_test.py to use two-step connect flow
    status: pending
  - id: run-sweep
    content: Run capacity sweep with direct WS, find per-worker saturation point
    status: pending
  - id: analyze-update
    content: Analyze results, derive scaling laws, update README
    status: pending
---

# Direct WebSocket Access + Per-Worker Capacity Test

## Problem

RunPod's LB is a Layer 7 HTTP proxy. It serializes WebSocket upgrades (20-60s per connection) and has a hard 5.5-minute processing timeout that kills long WS sessions. The docs explicitly state: *"Expose HTTP/TCP ports... is required for applications that need persistent connections, such as WebSockets."*

## Architecture Change

```mermaid
sequenceDiagram
    participant Client
    participant LB as RunPod_LB
    participant Worker as GPU_Worker

    Note over Client,Worker: Step 1: Discover worker address (HTTP, through LB - fast)
    Client->>LB: GET /connect
    LB->>Worker: route to healthy worker
    Worker-->>Client: {"ws_url": "wss://POD_ID-8000.proxy.runpod.net/ws"}

    Note over Client,Worker: Step 2: Connect WebSocket directly (bypasses LB)
    Client->>Worker: WSS direct connection (no 5.5min limit)
    Worker-->>Client: Ready
    Client->>Worker: audio chunks (real-time)
    Worker-->>Client: transcript stream
```

The key: RunPod workers are pods. Each exposed port gets a direct proxy URL at `{RUNPOD_POD_ID}-{PORT}.proxy.runpod.net`. This is a simple TCP proxy, not the LB -- no serialization, no 5.5min timeout. The `RUNPOD_POD_ID` env var is injected into every worker automatically.

## Changes Required

### 1. Server changes: [runpod_deploy/server.py](runpod_deploy/server.py)

**Add `/connect` endpoint** (after the existing `/health` endpoint, ~line 272):

```python
@app.get("/connect")
async def connect():
    """Connection broker: returns this worker's direct WS URL.
    Called through LB (fast HTTP), returns direct-access WS address
    that bypasses the LB proxy entirely."""
    pod_id = os.environ.get("RUNPOD_POD_ID", "")
    port = PORT
    if pod_id:
        direct_ws = f"wss://{pod_id}-{port}.proxy.runpod.net/ws"
    else:
        direct_ws = None  # Fallback: not on RunPod
    
    return JSONResponse(content={
        "ws_url": direct_ws,
        "pod_id": pod_id,
        "port": port,
        "active_streams": len(client_queues),
        "max_streams": cfg.streaming.max_active_streams if cfg else 1000,
    })
```

**Fix GPU monitoring** in `/ping` (line 248-253): Replace bare `except Exception` with explicit attribute access that works across PyTorch versions:

```python
try:
    vram_used = torch.cuda.memory_allocated(0) / 1e9
    vram_total = torch.cuda.get_device_properties(0).total_mem / 1e9
    gpu_name = torch.cuda.get_device_name(0)
except Exception as e:
    vram_used = vram_total = 0
    gpu_name = f"error: {e}"
```

### 2. Dockerfile changes: [runpod_deploy/Dockerfile](runpod_deploy/Dockerfile)

Expose port 8000 (the actual server port set by template env vars):

```dockerfile
ENV PORT=8000
EXPOSE 8000
```

### 3. Template update via GraphQL API

Update template `dnbglsthvp` to:

- Use new Docker image tag `:v3-direct`
- Set `ports: "8000/http"` to expose the WS port for direct access
- Keep env vars: `PORT=8000`, `HF_TOKEN=...`

### 4. New capacity test: [worker_capacity_test.py](worker_capacity_test.py) (rewrite core flow)

Replace the entire connection logic with a two-step flow:

- Step 1: `GET /connect` via LB URL -> get `ws_url` (direct WS address)
- Step 2: Connect WebSocket to `ws_url` directly (0.4s, no LB queue)
- Connections now bypass the LB entirely -- should be fast and reliable

The sweep levels, sustained load, health monitoring, and analysis all stay the same.

### 5. Cleanup files to delete

- `cap.log`, `cap5.log`, `capacity_debug.log`, `capacity_final.log`, `capacity_sweep.log`, `capacity_sweep2.log`, `capacity_sweep_final.log` -- old/failed test logs
- `run_sweep.sh` -- temp helper script
- `stress_test_runpod.py` -- superseded by `worker_capacity_test.py`
- `benchmark.py`, `load_test.py`, `test_batched_latency.py`, `test_concurrent_proper.py`, `test_local_concurrent.py`, `test_local_latency.py` -- old local test files no longer needed

Keep: `stress_test.py` (Modal test, different target), `worker_capacity_test.py` (our active test), `transcribe.py` (utility)

## Execution Sequence

1. Edit `server.py` + `Dockerfile` (~5 min)
2. Build Docker image `:v3-direct` (~15-20 min)
3. Push to GHCR (~10 min)
4. Update template + restart worker via GraphQL (~3 min)
5. Verify `/connect` returns direct WS URL (~1 min)
6. Quick smoke test: connect WS directly, verify 0.4s connect + transcription works (~2 min)
7. Run capacity sweep: [5, 10, 20, 30, 50, 80, 100, 150, 200] with 60s sustained load (~30 min)
8. Analyze results, find saturation knee point, update README

## Risk: Direct Proxy URL Format Unknown

The `{POD_ID}-{PORT}.proxy.runpod.net` format works for regular pods. For serverless workers, it might differ. Mitigation: the `/connect` endpoint also returns the `pod_id` and all `RUNPOD_*` env vars in a debug field. If the proxy URL doesn't work, we can inspect the actual env to find the correct format -- one rebuild cycle at most.