Self-Hosting LLMs with vLLM
A practical guide to self-hosting open-source language models using vLLM, covering setup, batching, and serving for production workloads.
What you'll learn
- ✓Why vLLM is fast for LLM serving
- ✓How PagedAttention manages memory
- ✓How to start a vLLM server locally
- ✓How to send OpenAI-style requests
- ✓When self-hosting pays off versus APIs
Prerequisites
- •Familiar with APIs
- •Basic Python and Linux
What and Why
vLLM is an open-source inference engine designed to serve large language models with high throughput and low latency. It was built at UC Berkeley to solve a real bottleneck: standard transformer inference wastes a lot of GPU memory on padding and key-value cache fragmentation. vLLM introduces PagedAttention, a technique inspired by virtual memory in operating systems, which lets the engine pack many concurrent requests into the same GPU efficiently.
Why self-host at all? Hosted APIs are convenient, but they come with per-token costs, data residency concerns, and rate limits. If you have steady traffic, sensitive data, or want to use a fine-tuned open-weight model like Llama, Mistral, or Qwen, hosting yourself can be cheaper and more flexible. vLLM is the most common production choice for this.
Mental Model
Think of an LLM server as a tiny restaurant. Each request is a customer ordering a meal of variable length. Naive serving cooks one meal at a time. Continuous batching, which vLLM uses, is like a kitchen that constantly swaps new orders in as soon as old ones finish a step, keeping the stove fully busy.
PagedAttention is the pantry organizer. Instead of reserving a whole shelf for each customer just in case, it stores ingredients (the KV cache) in small fixed blocks that can be reused across customers as they leave. This means more concurrent users on the same GPU.
Hands-on Example
Install vLLM in a fresh environment with CUDA available:
pip install vllm
Start a server that exposes an OpenAI-compatible API:
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--port 8000
Now you can call it like any OpenAI endpoint:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.2",
messages=[{"role": "user", "content": "Explain PagedAttention briefly."}],
)
print(resp.choices[0].message.content)
The flow looks like this:
+----------+ +---------------------+ +-----------+
| Client A | ---> | | ---> | |
+----------+ | vLLM Scheduler | | GPU |
+----------+ | (continuous batch) | <--> | KV blocks |
| Client B | ---> | | | |
+----------+ +---------------------+ +-----------+
|
v
OpenAI-compatible API
The scheduler interleaves tokens from many concurrent requests. As soon as one request finishes, its KV blocks free up and a new request slides into the batch with no idle GPU time.
Trade-offs
Self-hosting with vLLM is not a free lunch. You take on operational work: GPU provisioning, driver versions, monitoring, autoscaling, and model updates. A single A100 or H100 has a fixed throughput ceiling, so peak traffic still needs capacity planning.
Hosted APIs win when traffic is bursty, when you want frontier models, or when your team has no infra capacity. vLLM wins when you have steady volume, want fine-tuned open models, or need predictable latency and cost. There is also a quality gap: open-weight 7B to 70B models are excellent but still trail the best closed models on hard reasoning.
Memory is the usual constraint. A 7B model in FP16 needs roughly 14 GB of VRAM just for weights, plus KV cache room. Quantization (AWQ, GPTQ, FP8) can shrink this but adds another tuning dimension.
Practical Tips
- Start with a quantized model if your GPU is tight. AWQ INT4 often gives near-FP16 quality at a quarter the memory.
- Set
--max-model-lento the actual context you need. Larger context reserves more KV memory per request. - Use
--tensor-parallel-sizeto shard a big model across multiple GPUs in one node. - Put a small proxy in front for auth, rate limiting, and logging. vLLM does not do this for you.
- Measure both throughput (tokens per second across all users) and latency (time to first token). They trade off.
- Pin the vLLM and CUDA versions in your deployment. Upgrades sometimes change defaults.
Wrap-up
vLLM turns a raw open-weight model into a production-ready serving stack with a single command. PagedAttention and continuous batching are the core ideas that make it fast, and the OpenAI-compatible API means you can drop it in behind code you already wrote against hosted models. Self-hosting is worth it when your traffic is predictable, your data is sensitive, or your models are tuned for your domain, and vLLM is the path of least resistance for that journey.
Related articles
- LLMs LLM Cost Tracking in Production
A practical guide to attributing, monitoring, and controlling LLM spend per user, per feature, and per request without slowing down delivery.
- LLMs LLM Fine-tuning vs Prompting Trade-offs
Decide between prompt engineering, retrieval, and fine-tuning by weighing cost, latency, control, and data requirements honestly.
- LLMs LLM Function Schema Best Practices
How to design tool schemas that LLMs actually call correctly, with naming, description, and parameter patterns that survive real users and adversarial inputs.
- LLMs Grok vs Claude vs GPT: A Practical Comparison
An engineering-focused comparison of Grok, Claude, and GPT model families across reasoning, tool use, context, latency, and real production trade-offs.