Backlink asset / Citation-friendly
Editorial

Local AI Engine Choice Matrix (2026)

Ten local AI engines scored across thirteen operational dimensions. The matrix is shaped so that no engine wins every column — different engines win on different axes, and that’s the point. Read it row-first: pick your engine, then scan the row to see where it’s strong, where it’s acceptable, and where it stops working.

Every cell carries a one-line operator-readable caveat naming the assumption. “Limited” is not a slur — it’s an honest label for a real ceiling. Open WebUI and AnythingLLM are frontends that sit on top of other engines, so their hardware-direct columns read “n/a”. vLLM and TensorRT-LLM are NVIDIA + Linux first; their macOS and Apple Silicon columns read “n/a”. Use the matrix to eliminate, not to anoint.

Last reviewed 2026-05-07 · By Fredoline Eruo, Independent Local AI Researcher.

Embed this matrix

Linking to this page from a Reddit comment, GitHub README, or blog post? Use this snippet:

<a href="https://runlocalai.co/resources/local-ai-engine-choice-matrix" rel="noopener">RunLocalAI Local AI Engine Choice Matrix (2026)</a>

License: CC-BY-4.0. Attribution appreciated; screenshots of the matrix are welcome with a link back.

Dimension
Beginner
Friendliness
Production
Serving
Windows
OS support
macOS
OS support
Linux
OS support
NVIDIA
Hardware
AMD
Hardware
Apple Silicon
Hardware
Multi-user
Concurrent serving
Agents
Workflows
RAG
Workflows
Offline
Air-gapped use
Maintenance
Burden (lower=better)
Ollama
Local-first wrapper around llama.cpp. The default on-ramp; not a serving stack.
Excellent
Single installer, `ollama run` works in under a minute.
Limited
Single-instance daemon; no native batching, queue, or auth.
Strong
Native Windows installer; CUDA on supported NVIDIA cards.
Excellent
First-class on Apple Silicon via Metal.
Excellent
Standard target; package + systemd unit ship out of the box.
Strong
CUDA via llama.cpp; works on most consumer + datacenter cards.
Acceptable
ROCm path lands as upstream stabilizes; rough on RDNA2.
Excellent
Metal backend; matches llama.cpp's M-series numbers.
Limited
OLLAMA_NUM_PARALLEL helps; no continuous batching.
Strong
OpenAI-compatible endpoint; works with most agent frameworks.
Strong
Native embeddings via /api/embeddings; pairs with Open WebUI / AnythingLLM.
Excellent
Pull once, run forever; manifest pin makes air-gap easy.
Excellent
Auto-update or pin; lowest operator burden in the ecosystem.
llama.cpp
The reference cross-platform inference runtime. Source of truth for GGUF.
Acceptable
CLI-first; build flags + quant choices to learn before tok/s.
Acceptable
llama-server is fine for small fleets; no native multi-tenant.
Strong
Native binaries; CUDA, Vulkan, and CPU paths all supported.
Excellent
Metal backend is upstream-maintained; first-class.
Excellent
Primary development target.
Excellent
CUDA + cuBLAS path; broad card coverage.
Strong
Vulkan + ROCm both work; Vulkan is the most reliable AMD option.
Excellent
Metal kernels updated alongside macOS; reference path.
Limited
Sequential by default; add llama-swap or LocalAI to multiplex.
Acceptable
OpenAI shim available; tool-call quality lags Ollama / vLLM.
Strong
Embedding mode + reranker support; pairs with any vector DB.
Excellent
Single binary + GGUF file; the canonical air-gap runtime.
Strong
Few moving parts; pin commit hash + GGUF and you're done.
vLLM
Production serving for NVIDIA. Continuous batching + paged attention.
Limited
Python + CUDA + driver pinning; not a one-command install.
Excellent
Industry default for shared GPU serving; metrics + queue native.
Limited
WSL2 only; native Windows is uncovered in our corpus.
Not supported; CUDA-only runtime.
Excellent
Linux is the only first-class target.
Excellent
Built around CUDA + flash-attention; A100/H100/4090 all sing.
Acceptable
ROCm fork exists; lags upstream by weeks and breaks at major versions.
No Metal backend; Apple Silicon is out of scope.
Excellent
Continuous batching + paged KV cache; linear scale to dozens of users.
Strong
OpenAI-compatible; tool-call works but parsing varies by model.
Strong
High-throughput retrieval-side serving; embeddings via separate model.
Strong
Air-gappable but heavy: full Python + CUDA + flash-attention bundle.
Limited
Pin Python + CUDA + flash-attention + vLLM versions or it breaks.
SGLang
High-throughput serving with structured output + radix attention. NVIDIA-first.
Limited
Research-grade DX; expect to read source for non-default flows.
Strong
Used in production at LMSYS; structured-output story is best-in-class.
Limited
WSL2 only; native Windows uncovered in our corpus.
Not supported; CUDA-first runtime.
Excellent
Linux + CUDA is the only supported configuration.
Excellent
Radix attention + flash-attention; competitive with vLLM on throughput.
Limited
ROCm path uncovered in our corpus; treat as NVIDIA-only.
No Metal backend.
Excellent
Built for it; cache-prefix sharing across users is the differentiator.
Excellent
Native structured outputs + grammar; the strongest agent-runtime fit.
Strong
Prefix caching pays off when many requests share retrieval context.
Acceptable
Possible but heavy; Python + CUDA stack must be staged offline.
Limited
Younger than vLLM; release cadence is fast and breaking.
TensorRT-LLM
NVIDIA's optimized serving stack. Highest throughput, highest setup cost.
Poor
Engine-build step + CUDA toolkit + container; multi-day onboarding.
Excellent
Highest measured tok/s on H100/H200; the datacenter ceiling.
Limited
WSL2 only; native Windows is not a supported target.
NVIDIA-only stack.
Excellent
Linux + container is the canonical deployment shape.
Excellent
First-party; only path that fully exploits H100 transformer engine.
NVIDIA-only by design.
NVIDIA-only by design.
Excellent
Triton inference server integration; built for shared serving.
Strong
OpenAI-compatible via Triton; structured output via guided decoding.
Strong
Embedding model serving via separate Triton instance; high QPS.
Limited
Engine builds are GPU-arch specific; offline rebuilds are painful.
Poor
Engine rebuilds on driver/CUDA bumps; the heaviest ops burden of the ten.
MLX
Apple's framework for Apple Silicon. Unified memory + Metal kernels.
Acceptable
pip install + Python; not GUI-friendly but simple for developers.
Limited
Single-machine; no native multi-tenant serving layer.
Apple Silicon only.
Excellent
First-class; Apple-maintained framework.
Apple Silicon only.
Not an NVIDIA target.
Not an AMD target.
Excellent
Native Metal + unified memory; the M-series reference runtime.
Acceptable
Single-user is the design center; multi-user is research territory.
Acceptable
mlx-lm OpenAI shim works; tool-call ergonomics lag llama.cpp.
Acceptable
Embeddings via mlx-embeddings; small ecosystem vs llama.cpp.
Strong
macOS-only but fully air-gappable once weights are pulled.
Strong
Apple-managed; macOS major version bumps are the only break point.
ExLlamaV2
Specialist runtime for EXL2-quantized weights on NVIDIA. Throughput-optimized.
Limited
Python + EXL2 quant pipeline; expect to convert weights yourself.
Acceptable
TabbyAPI wrapper makes it serveable; ecosystem is small.
Strong
Native Windows + CUDA work; community runs it on consumer 30/40-series.
CUDA-only.
Excellent
Primary development target.
Excellent
Exploits NVIDIA INT4/INT8 paths; competitive single-stream tok/s.
CUDA-only; no ROCm backend.
CUDA-only.
Limited
TabbyAPI adds a queue; not a continuous-batching serving stack.
Acceptable
OpenAI shim via TabbyAPI; works but the agent ecosystem assumes vLLM.
Limited
No native embedding path; pair with a separate embedder.
Strong
Self-contained Python + EXL2 weights; air-gap-friendly.
Acceptable
Smaller community; CUDA + Python pinning still required.
LM Studio
Desktop GUI app wrapping llama.cpp + MLX. Beginner default for desktop.
Excellent
Drag-drop GGUF, hit run; the most beginner-friendly option.
Limited
Single-user desktop app; not a server.
Excellent
First-class desktop target.
Excellent
First-class on Apple Silicon; uses MLX backend internally.
Strong
AppImage available; not the primary target but works.
Strong
CUDA via llama.cpp; covers most consumer cards.
Acceptable
Vulkan path via llama.cpp; ROCm coverage trails NVIDIA.
Excellent
MLX backend; the default GUI for Apple Silicon users.
Limited
Local OpenAI server is single-user; not a serving target.
Acceptable
OpenAI server endpoint works; tool-call quality is model-dependent.
Limited
No first-party RAG; pair with AnythingLLM or Open WebUI.
Strong
Fully offline once models are downloaded.
Excellent
GUI updates via the app itself; no version-pinning ritual.
Open WebUI
Self-hosted ChatGPT-style UI. Sits on top of Ollama, OpenAI-compatible APIs, and others.
Strong
Docker compose + browser; not as quick as Ollama alone.
Acceptable
Multi-user UI layer; serving is whatever runs underneath.
Strong
Docker Desktop on Windows is the standard install.
Strong
Docker Desktop on macOS works; some users run via uv directly.
Excellent
Native Docker target; the canonical homelab deployment.
Not an engine; inherits whatever runtime sits behind it.
Not an engine; inherits whatever runtime sits behind it.
Not an engine; inherits whatever runtime sits behind it.
Strong
Built-in user accounts + roles; throughput is the engine's problem.
Strong
Native tools, pipelines, and function-calling UI on top of any backend.
Excellent
First-class document upload + retrieval; the built-in RAG winner.
Strong
Self-hosted; air-gappable once images are mirrored.
Strong
Docker pull + restart; backup the data volume and you're set.
AnythingLLM
Desktop or self-hosted RAG-first frontend. Sits on top of Ollama, LM Studio, OpenAI APIs.
Strong
Desktop installer plus a Docker option; RAG-first onboarding.
Limited
Multi-workspace UI; throughput follows the underlying engine.
Strong
Native desktop installer; common on Windows homelabs.
Strong
Native desktop installer; pairs naturally with Ollama or LM Studio.
Strong
AppImage + Docker both supported.
Not an engine; inherits whatever runtime it's pointed at.
Not an engine; inherits whatever runtime it's pointed at.
Not an engine; inherits whatever runtime it's pointed at.
Acceptable
Workspaces + users in the Docker edition; desktop is single-user.
Strong
Built-in agent skills + MCP tool support; pairs with any LLM backend.
Excellent
RAG is the product; vector DB + chunker + reranker out of the box.
Strong
Self-hosted desktop or Docker; fully offline once configured.
Strong
Auto-update on desktop; Docker pull on the server edition.

Which engine on Windows with NVIDIA?

Two viable answers depending on what you’re doing. For solo desktop use, LM Studio is the shortest path — native Windows, drag-drop GGUF, CUDA on every consumer card. For homelab serving where you want an HTTP endpoint, Ollama is the most honest answer; it’s a single Windows installer, the OpenAI-compatible API works with every agent framework, and the maintenance burden is genuinely low. Skip vLLM and SGLang on native Windows — the WSL2 detour doesn’t pay off until you’re actually serving multiple users.

Which engine for production serving?

vLLM if you have NVIDIA GPUs and need shared inference for tens of users; continuous batching + paged attention is the reason most production stacks converge here. SGLang if your workload is agent-heavy or shares prefixes across requests — radix attention pays off there. TensorRT-LLMis the ceiling on H100-class hardware, but the engine- build step and ops burden mean you should only reach for it when you’ve already proven you need the extra throughput. Ollama is fine for a couple of internal users; it’s not a serving stack.

Which engine on Apple Silicon?

For most operators, Ollama on macOS is the right answer — it uses Metal under the hood, ships with the right defaults, and stays out of the way. Developers comfortable in Python who want the absolute best M-series numbers should reach for MLXvia mlx-lm; the unified-memory architecture is what it’s designed around. LM Studio is the GUI option and uses MLX internally on Apple Silicon, so the runtime story is the same with a friendlier UI. Skip vLLM, SGLang, TensorRT-LLM, and ExLlamaV2 entirely on macOS — they don’t support it.

Which engine for agent workloads?

The honest answer is two-tiered. For prototyping or solo agent development, Ollama + your agent framework of choice is the path of least resistance — tool-calling works on most tuned models and the iteration loop is fast. For production agent serving where structured outputs and prefix caching matter, SGLang is the strongest fit; native grammar-constrained decoding and shared-prefix radix attention are the right primitives. vLLM is a competent third option if you’ve already standardized on it elsewhere.

Which engine for RAG?

RAG is a frontend question more than an engine question. Open WebUI with Ollama underneath is the shortest path to a multi-user RAG instance — document upload, chunking, retrieval, and reranking are all built in. AnythingLLM is the equivalent if you want a desktop-first install or workspace-style isolation. For raw retrieval throughput on shared NVIDIA hardware, point either frontend at vLLM or SGLang and run the embedding model on a second instance. The engine you pick for inference matters less than the chunker, embedder, and reranker upstream of it.

Which engine on AMD?

llama.cpp via Vulkan is the most reliable answer in 2026 — ROCm has matured, but Vulkan still works on more cards more consistently. Ollamainherits that path and is the easier wrapper. vLLM’s ROCm fork exists and is usable on supported MI-series cards but lags upstream. ExLlamaV2, TensorRT-LLM, MLX, and SGLang are not viable AMD targets — treat their AMD cells as red flags, not roadblocks you can route around.

Next steps

vLLM vs llama.cpp vs Ollama vs MLX vs LM Studio across eleven operational dimensions.