Skip to content
C Codeloom
LLMs

Self-Hosting LLMs with vLLM

A practical guide to self-hosting open-source language models using vLLM, covering setup, batching, and serving for production workloads.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • Why vLLM is fast for LLM serving
  • How PagedAttention manages memory
  • How to start a vLLM server locally
  • How to send OpenAI-style requests
  • When self-hosting pays off versus APIs

Prerequisites

  • Familiar with APIs
  • Basic Python and Linux

What and Why

vLLM is an open-source inference engine designed to serve large language models with high throughput and low latency. It was built at UC Berkeley to solve a real bottleneck: standard transformer inference wastes a lot of GPU memory on padding and key-value cache fragmentation. vLLM introduces PagedAttention, a technique inspired by virtual memory in operating systems, which lets the engine pack many concurrent requests into the same GPU efficiently.

Why self-host at all? Hosted APIs are convenient, but they come with per-token costs, data residency concerns, and rate limits. If you have steady traffic, sensitive data, or want to use a fine-tuned open-weight model like Llama, Mistral, or Qwen, hosting yourself can be cheaper and more flexible. vLLM is the most common production choice for this.

Mental Model

Think of an LLM server as a tiny restaurant. Each request is a customer ordering a meal of variable length. Naive serving cooks one meal at a time. Continuous batching, which vLLM uses, is like a kitchen that constantly swaps new orders in as soon as old ones finish a step, keeping the stove fully busy.

PagedAttention is the pantry organizer. Instead of reserving a whole shelf for each customer just in case, it stores ingredients (the KV cache) in small fixed blocks that can be reused across customers as they leave. This means more concurrent users on the same GPU.

Hands-on Example

Install vLLM in a fresh environment with CUDA available:

pip install vllm

Start a server that exposes an OpenAI-compatible API:

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --port 8000

Now you can call it like any OpenAI endpoint:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

resp = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Explain PagedAttention briefly."}],
)
print(resp.choices[0].message.content)

The flow looks like this:


+----------+      +---------------------+      +-----------+
| Client A | ---> |                     | ---> |           |
+----------+      |  vLLM Scheduler     |      |   GPU     |
+----------+      |  (continuous batch) | <--> | KV blocks |
| Client B | ---> |                     |      |           |
+----------+      +---------------------+      +-----------+
                        |
                        v
                OpenAI-compatible API
vLLM request flow with continuous batching

The scheduler interleaves tokens from many concurrent requests. As soon as one request finishes, its KV blocks free up and a new request slides into the batch with no idle GPU time.

Trade-offs

Self-hosting with vLLM is not a free lunch. You take on operational work: GPU provisioning, driver versions, monitoring, autoscaling, and model updates. A single A100 or H100 has a fixed throughput ceiling, so peak traffic still needs capacity planning.

Hosted APIs win when traffic is bursty, when you want frontier models, or when your team has no infra capacity. vLLM wins when you have steady volume, want fine-tuned open models, or need predictable latency and cost. There is also a quality gap: open-weight 7B to 70B models are excellent but still trail the best closed models on hard reasoning.

Memory is the usual constraint. A 7B model in FP16 needs roughly 14 GB of VRAM just for weights, plus KV cache room. Quantization (AWQ, GPTQ, FP8) can shrink this but adds another tuning dimension.

Practical Tips

  • Start with a quantized model if your GPU is tight. AWQ INT4 often gives near-FP16 quality at a quarter the memory.
  • Set --max-model-len to the actual context you need. Larger context reserves more KV memory per request.
  • Use --tensor-parallel-size to shard a big model across multiple GPUs in one node.
  • Put a small proxy in front for auth, rate limiting, and logging. vLLM does not do this for you.
  • Measure both throughput (tokens per second across all users) and latency (time to first token). They trade off.
  • Pin the vLLM and CUDA versions in your deployment. Upgrades sometimes change defaults.

Wrap-up

vLLM turns a raw open-weight model into a production-ready serving stack with a single command. PagedAttention and continuous batching are the core ideas that make it fast, and the OpenAI-compatible API means you can drop it in behind code you already wrote against hosted models. Self-hosting is worth it when your traffic is predictable, your data is sensitive, or your models are tuned for your domain, and vLLM is the path of least resistance for that journey.