> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vast.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Quickstart

> Deploy your first vLLM endpoint

## Prerequisites

Before you begin, make sure you have:

<CardGroup cols={3}>
  <Card title="Vast.ai Account" icon="user">
    Sign up at [cloud.vast.ai](https://cloud.vast.ai) and add credits to your account
  </Card>

  <Card title="API Key" icon="key">
    Generate an API key from your [account settings](https://docs.vast.ai/keys)
  </Card>

  <Card title="HuggingFace Token" icon="robot">
    Create a [HuggingFace account](https://huggingface.co) and generate a [read-access token](https://huggingface.co/settings/tokens) for gated models
  </Card>
</CardGroup>

## Configuration

### Install the Vast SDK

Install the SDK that you'll use to interact with your serverless endpoints:

```bash  theme={null}
pip install vastai_sdk
```

<Note>
  The SDK provides an async Python interface for making requests to your endpoints. You'll use this after setting up your infrastructure.
</Note>

### API Key Setup

Set your Vast.ai API key as an environment variable:

```bash  theme={null}
export VAST_API_KEY="your-api-key-here"
```

The SDK will automatically use this environment variable for authentication. Alternatively, you can pass the API key directly when initializing the client:

```python  theme={null}
client = Serverless(api_key="your-api-key-here")
```

### HuggingFace Token Setup

Many popular models like Llama and Mistral require authentication to download. Configure your HuggingFace token once at the account level:

1. Navigate to your [Account Settings](https://cloud.vast.ai/account/)
2. Expand the **"Environment Variables"** section
3. Add a new variable:
   * **Key**: `HF_TOKEN`
   * **Value**: Your HuggingFace read-access token
4. Click the **"+"** button, then **"Save Edits"**

<Note>
  This token will be securely available to all your serverless workers. You only need to set it once for your account.
</Note>

<Warning>
  Without a valid HF\_TOKEN, workers will fail to download gated models and remain in "Loading" state indefinitely.
</Warning>

## Deploy Your First Endpoint

<Steps>
  <Step title="Create an Endpoint">
    Navigate to the [Serverless Dashboard](https://cloud.vast.ai/serverless/) and click **"Get Started"**.

    Use these recommended settings for your first deployment:

    | Setting                | Value           | Description                                   |
    | ---------------------- | --------------- | --------------------------------------------- |
    | **Endpoint Name**      | `vLLM-Qwen3-8B` | Choose a descriptive name for your endpoint   |
    | **Cold Multiplier**    | 3               | Scales capacity based on predicted load       |
    | **Minimum Workers**    | 5               | Pre-loaded instances for instant scaling      |
    | **Max Workers**        | 16              | Maximum GPU instances                         |
    | **Minimum Load**       | 1               | Baseline tokens/second instantaneous capacity |
    | **Minimum Cold Load**  | 0               | Baseline tokens/second total capacity         |
    | **Target Utilization** | 0.9             | Resource usage target (90%)                   |

        <img src="https://mintcdn.com/vastai-80aa3a82/FmaSXr63RVn7iFQw/images/serverless_quickstart_create_endpoint.png?fit=max&auto=format&n=FmaSXr63RVn7iFQw&q=85&s=7c3edeb848ad828533bbfe3cdba2924f" alt="" data-og-width="467" width="467" data-og-height="774" height="774" data-path="images/serverless_quickstart_create_endpoint.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/vastai-80aa3a82/FmaSXr63RVn7iFQw/images/serverless_quickstart_create_endpoint.png?w=280&fit=max&auto=format&n=FmaSXr63RVn7iFQw&q=85&s=216d055677280544d04592ea7ce26767 280w, https://mintcdn.com/vastai-80aa3a82/FmaSXr63RVn7iFQw/images/serverless_quickstart_create_endpoint.png?w=560&fit=max&auto=format&n=FmaSXr63RVn7iFQw&q=85&s=cd16ba9f0dbb37374a505ffdd7d13a12 560w, https://mintcdn.com/vastai-80aa3a82/FmaSXr63RVn7iFQw/images/serverless_quickstart_create_endpoint.png?w=840&fit=max&auto=format&n=FmaSXr63RVn7iFQw&q=85&s=6da0f18dd41bd6255ebce7e3eb476f81 840w, https://mintcdn.com/vastai-80aa3a82/FmaSXr63RVn7iFQw/images/serverless_quickstart_create_endpoint.png?w=1100&fit=max&auto=format&n=FmaSXr63RVn7iFQw&q=85&s=2a6d7f2a07e56d856c3a32daaef2c1b9 1100w, https://mintcdn.com/vastai-80aa3a82/FmaSXr63RVn7iFQw/images/serverless_quickstart_create_endpoint.png?w=1650&fit=max&auto=format&n=FmaSXr63RVn7iFQw&q=85&s=17e49b3b32d3e35128e669e4f54c4ca0 1650w, https://mintcdn.com/vastai-80aa3a82/FmaSXr63RVn7iFQw/images/serverless_quickstart_create_endpoint.png?w=2500&fit=max&auto=format&n=FmaSXr63RVn7iFQw&q=85&s=8628d8be1c9e144468b8c77135922e1e 2500w" />

    Click **"Next"** to proceed.
  </Step>

  <Step title="Create a Workergroup">
    You will now be taken to the **Create Workergroup** page.

    Select the **vLLM (Serverless)** template, which comes pre-configured with:

    * **Model**: Qwen/Qwen3-8B (8 billion parameter LLM)
    * **Framework**: vLLM for high-performance inference
    * **API**: OpenAI-compatible endpoints

    The template will automatically select appropriate GPUs with enough VRAM for the model.

        <img src="https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_create_workergroup.png?fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=5883e2deb283d2b05fae8377482e18f4" alt="" data-og-width="2912" width="2912" data-og-height="1582" height="1582" data-path="images/serverless_quickstart_create_workergroup.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_create_workergroup.png?w=280&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=223afe071aa639f1b738a734da1d03c4 280w, https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_create_workergroup.png?w=560&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=79f383c3a7a401033376c33e38fa6e73 560w, https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_create_workergroup.png?w=840&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=011eaede25b9ee3405be0ed82b7828c9 840w, https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_create_workergroup.png?w=1100&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=aea0c46743ddeb0c89a6bec124108797 1100w, https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_create_workergroup.png?w=1650&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=de09060e0bf61b0e1443423e15c5eecc 1650w, https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_create_workergroup.png?w=2500&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=cb84f535a9ecf55c3e4d05d75f3065d9 2500w" />

    Click **"Create"** to proceed with the default settings.
  </Step>

  <Step title="Wait for Workers to Initialize">
    Your serverless infrastructure is now being provisioned. **This process takes time** as workers need to:

    1. Start up the GPU instances
    2. Download the model (8GB for Qwen3-8B)
    3. Load the model into GPU memory
    4. Complete health checks

    <Warning>
      **Expect 3-5 minutes wait time** for workers to become ready, especially on first deployment. Larger models may take longer.
    </Warning>

    Monitor the worker status in the dashboard:

    * **Stopped**: Worker has the model loaded and is ready to activate on-demand (cold worker)
    * **Loading**: Worker is starting up and loading the model into GPU memory
    * **Ready**: Worker is active and handling requests

    You can view detailed statistics by clicking **"View detailed stats"** on the Workergroup.

    Monitor the instance logs to track the loading process:

    * Click on the “eye” icon to view the logs for a worker
    * Logs show model download progress, loading status, and any startup errors
    * This helps identify issues early rather than waiting for timeouts

        <img src="https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_loading_workers.png?fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=c380a623c8b62be3c43eea31e196d768" alt="" data-og-width="2924" width="2924" data-og-height="1500" height="1500" data-path="images/serverless_quickstart_loading_workers.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_loading_workers.png?w=280&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=02c75c5feb6de1d642b8648aa37bbe88 280w, https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_loading_workers.png?w=560&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=4fd5f49f3135290777c161ffef397e52 560w, https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_loading_workers.png?w=840&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=cd2cfb06185968bf23b491669f7413bf 840w, https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_loading_workers.png?w=1100&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=e0c4a30fa4c4f6c119136586d404fd5f 1100w, https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_loading_workers.png?w=1650&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=d652af7ef36d11d8be77af08b767f205 1650w, https://mintcdn.com/vastai-80aa3a82/JJmO4K619r2EN0Oj/images/serverless_quickstart_loading_workers.png?w=2500&fit=max&auto=format&n=JJmO4K619r2EN0Oj&q=85&s=fff0d4b0961f026beba4330e08d5ab82 2500w" />

    <Note>
      The SDK automatically holds and retries requests until workers are ready. However, for best performance, wait for at least one worker to show "Ready" or "Stopped" status before making your first call.
    </Note>
  </Step>
</Steps>

## Make Your First API Call

### Basic Usage

With the SDK installed, here's how to make your first API call:

```python  theme={null}
import asyncio
from vastai import Serverless
MAX_TOKENS = 100

async def main():
    # Initialize the client with your API key
    # The SDK will automatically use the VAST_API_KEY environment variable if set
    client = Serverless()  # Uses VAST_API_KEY environment variable

    # Get your endpoint
    endpoint = await client.get_endpoint(name="vLLM-Qwen3-8B")

    # Prepare your request payload
    payload = {
          "model": "Qwen/Qwen3-8B",
          "prompt": "Explain quantum computing in simple terms",
          "max_tokens": MAX_TOKENS,
          "temperature": 0.7
    }

    # Make the request
    result = await endpoint.request("/v1/completions", payload, cost=MAX_TOKENS)

    # The SDK returns a wrapper object with metadata
    # Access the OpenAI-compatible response via result["response"]
    print(result["response"]["choices"][0]["text"])

    # Clean up
    await client.close()

if __name__ == "__main__":
    asyncio.run(main())
```

<Note>
  The SDK handles all the routing, worker assignment, and authentication automatically. You just need to specify your endpoint name and make requests.
</Note>

## Troubleshooting

<AccordionGroup>
  <Accordion title="Workers stuck in 'Loading' state">
    * Check if the GPU has enough VRAM for your model
    * Verify your model name is correct
    * Check worker logs in the dashboard by clicking on the worker
    * Ensure your HF\_TOKEN is properly configured for gated models
  </Accordion>

  <Accordion title="'No workers available' error">
    * The SDK automatically retries requests until workers are ready
    * If this persists, check endpoint status in the [Serverless Dashboard](https://cloud.vast.ai/serverless/)
    * Verify workers are not stuck in "Loading" state (see troubleshooting above)
  </Accordion>

  <Accordion title="Slow response times">
    * First request may take longer as workers activate from cold state
    * Increase `max_workers` if all workers are full with requests
    * Increase `min_load` if there aren't enough workers immediately available when multiple requests are sent
    * If there are large spikes of requests, increase `cold_workers` or decrease target utilization
    * Consider worker region placement relative to your users
  </Accordion>
</AccordionGroup>

***

<Note>
  **Need help?** Join our [Discord community](https://discord.gg/hSuEbSQ4X8) or check the [detailed documentation](/serverless/architecture) for advanced configurations.
</Note>
