Running models on your own hardware stopped being a hobbyist curiosity somewhere around 2024 and is now a mainstream engineering choice. Privacy rules, per-token cost, latency, and offline requirements all push teams toward local inference. But "local AI" spans three very different tools - Ollama, vLLM, and Docker Model Runner and picking the wrong one wastes weeks. This guide compares them honestly: what each is for, where each falls down, and a decision framework you can apply on Monday.
What you will decide by the end
- Whether local inference is even the right call for your workload
- Ollama for laptops and quick prototyping
- vLLM for high-throughput production serving on GPUs
- Docker Model Runner for container-native, OCI-packaged models
- How to size hardware and avoid the classic VRAM trap
Info
Why local at all?
Three honest reasons: data that legally cannot leave your network, predictable cost at high volume, and low/offline latency. If none of those apply, a hosted API is usually cheaper and less work. Be honest about which camp you are in before investing in GPUs.
The three tools at a glance
They are not really competitors so much as tools for different jobs. Confusing them is the root of most bad local-AI decisions.
Ollama ▸ Single-machine, dev-friendly, "brew install and go"
Best for: laptops, prototypes, single-user tools
vLLM ▸ GPU inference server built for throughput & concurrency
Best for: production serving, many concurrent users
Docker Model Runner ▸ Run models as OCI artifacts in your container workflow
Best for: teams standardised on Docker/Kubernetes
| Dimension | Ollama | vLLM | Docker Model Runner |
|---|---|---|---|
| Best for | Laptops & prototyping | High-throughput serving | Container-native teams |
| Concurrency | Low (single user) | High (continuous batching) | Depends on runtime |
| Setup effort | Minimal | Moderate (GPU ops) | Low if already on Docker |
| Hardware | Laptop / single GPU | Real GPUs | Wherever Docker runs |
| API | OpenAI-compatible | OpenAI-compatible | Docker workflow / OCI |
Ollama: the developer laptop default
Ollama is the easiest on-ramp to local models. One install, one command, and you have a chat-capable model and an OpenAI-compatible endpoint on localhost. It handles model downloads, quantisation formats, and memory management for you, and runs happily on a MacBook's unified memory or a modest GPU.
# Install, pull a model, and chat — three lines.
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.3
ollama run llama3.3
# It also serves an OpenAI-compatible API on localhost:11434
curl http://localhost:11434/v1/chat/completions \
-d '{"model":"llama3.3","messages":[{"role":"user","content":"hi"}]}'
✓ Pros
- Trivial setup; great developer experience
- OpenAI-compatible API means easy app integration
- Excellent on Apple Silicon and single GPUs
- Huge library of ready-to-pull quantised models
✕ Cons
- Built for one machine, not high-concurrency serving
- Throughput collapses under many simultaneous requests
- Less control over batching and GPU memory tuning
- Not what you want behind a busy production endpoint
Tip
Use Ollama to prototype, not to scale
It is the perfect tool to validate that a local model is good enough for your task. Just do not mistake "works great for me on my laptop" for "ready to serve a thousand users" — that is vLLM's job.
vLLM: production throughput
vLLM is a serving engine engineered for one thing: getting the most tokens per second out of your GPUs while handling many concurrent requests. Its headline techniques — continuous batching and paged attention (PagedAttention) — let it pack far more simultaneous requests onto a GPU than naive serving, which is exactly what you need behind a real API.
pip install vllm
# Serve a model with an OpenAI-compatible endpoint, tuned for throughput.
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
✓ Pros
- State-of-the-art throughput via continuous batching
- Efficient KV-cache memory use with paged attention
- Scales across multiple GPUs (tensor parallelism)
- OpenAI-compatible server drops into existing clients
✕ Cons
- Needs real GPUs — not a laptop tool
- More configuration and ops knowledge required
- Overkill for single-user or low-volume use
- You own capacity planning, autoscaling, and monitoring
higher concurrent throughput from continuous batching vs one-request-at-a-time serving
Docker Model Runner: container-native models
Docker Model Runner treats models as first-class OCI artifacts — you pull and run a model much like a container image, and it integrates with the Docker tooling and Compose workflows your team already uses. For organisations standardised on containers, this means models live in the same registry, versioning, and deployment pipeline as everything else, which is a real operational win.
# Models as OCI artifacts, managed through the Docker workflow.
docker model pull ai/llama3.3
docker model run ai/llama3.3 "Explain paged attention in one sentence."
# Wire it into Compose alongside your app services.
docker model list
✓ Pros
- Models versioned and distributed like container images
- Fits naturally into existing Docker/Compose/registry workflows
- Consistent local-to-CI-to-prod packaging
- Lowers the "where does the model live" operational question
✕ Cons
- Youngest of the three — ecosystem still maturing
- Best fit only if you are already all-in on Docker
- Serving performance depends on the underlying runtime
- Fewer tuning knobs than a dedicated engine like vLLM
The decision framework
Strip it down to the question each tool answers best:
Prototyping or single-user on a laptop?
Use Ollama. You will be productive in minutes and can swap models freely.
Serving many concurrent users on GPUs?
Use vLLM. Throughput and memory efficiency are its entire reason to exist.
Standardised on containers and want models in the same pipeline?
Use Docker Model Runner so models follow the same build, registry, and deploy flow as your services.
Not sure local is worth it at all?
Prototype with Ollama, measure quality and cost, and only commit to GPU serving once a hosted API is demonstrably the wrong trade-off.
Sizing hardware without tears
The single most common local-AI mistake is underestimating VRAM. A rough rule: a model needs roughly its parameter count in bytes times the precision, plus headroom for the KV cache that grows with context length and concurrency. A quantised (4-bit) model is far more forgiving than full precision, which is why quantisation is the norm for local serving.
Warning
The VRAM trap
A "7B model" sounds small until you realise full-precision weights plus a long-context KV cache for several concurrent users can blow past consumer GPU memory. Plan for the KV cache, not just the weights — it is what actually exhausts your VRAM under load. Quantise, and measure at your real context length and concurrency.
Pro tip
Always benchmark with your actual prompts, context lengths, and concurrency — not a vendor's hero number. Tokens-per-second on a 50-token prompt tells you nothing about how the system behaves at 6,000 tokens with twenty users. Measure the workload you will actually run.
The bottom line
Local AI in 2026 is a solved problem at every scale — you just have to match the tool to the job. Reach for Ollama to explore and prototype, vLLM to serve at scale, and Docker Model Runner to keep models inside your container workflow. Decide honestly whether you even need local inference, size your VRAM for real load, and benchmark with your own traffic. Do that and you get the privacy, cost control, and latency that pushed you toward local in the first place.
! Common mistakes to avoid
-
✕Sizing VRAM for the model weights only.
✓Budget for the KV cache too — it grows with context length and concurrency and is what usually exhausts the GPU.
-
✕Putting Ollama behind a busy production API.
✓Use Ollama to prototype; serve real concurrency with vLLM, which is built for throughput.
-
✕Trusting a vendor's hero tokens/sec number.
✓Benchmark with your real prompts, context length, and concurrency before committing hardware.
-
✕Going local when a hosted API would be cheaper and simpler.
✓Only run local for privacy, predictable high-volume cost, or offline/low-latency needs.
? Frequently asked questions
Is Ollama good enough for production? +
For single-user or low-volume internal tools, yes. For serving many concurrent users its throughput collapses — that is what vLLM is built for. Use Ollama to prototype and validate quality.
What makes vLLM faster than naive serving? +
Continuous batching and paged attention (PagedAttention) let it pack far more concurrent requests onto a GPU and use KV-cache memory efficiently, dramatically raising throughput.
How much GPU memory do I need? +
Roughly the model size at your chosen precision (quantised 4-bit is far more forgiving than full precision) plus headroom for the KV cache, which scales with context length and concurrent users. Measure at your real workload.
Can I use all three tools together? +
Yes, and many teams do: Ollama on developer laptops, vLLM behind the production API, and Docker Model Runner to package and version the chosen model consistently across both.
Do these expose an OpenAI-compatible API? +
Ollama and vLLM both serve OpenAI-compatible endpoints, so most existing client code works against a local model with little more than a base-URL change.
Success
Mix and match
Mature teams often use all three: Ollama on developer laptops, vLLM behind the production API, and Docker Model Runner to package and ship the chosen model consistently between them. They are complementary, not mutually exclusive.
Comments
0No comments yet. Be the first to share your thoughts.