Operating manual · Runtime compatibility

Local AI runtime compatibility matrix

What actually runs on your hardware. The cross-OS, cross-backend matrix for 25 runtimes — vLLM, llama.cpp, Ollama, MLX, ONNX Runtime, IPEX-LLM, ExLlamaV2, and the rest. Every cell carries an operator caveat, not a checkmark.

RuntimeOSBackendMobileDockerQuantFitCaveatBest for
WinmacOSLinuxCUDAROCmAppleIntelGGUFAWQGPTQFP8Beg.ProdDist
Ollama······Single-user, sequential decode. Concurrency tops out at 4-8 requests; no continuous batching.first install, model swapping, hobby use
LM Studio········GUI-first with built-in Hugging Face browser. Not a server; OpenAI-compatible endpoint is local-host only by default.non-CLI users, model exploration
vLLM·······Linux + NVIDIA-only in practice. AMD ROCm path exists but lags. Continuous batching + PagedAttention is the production reference.production serving, multi-tenant inference
SGLang·······RadixAttention prefix-cache compounds for agent loops with stable system prompts. Beats vLLM at high prefix-cache hit rates.agent serving, multi-tenant with stable prompts
llama.cpp·····The most-portable runtime. CPU + every GPU backend. Layer-split for asymmetric multi-GPU. Lags vLLM at concurrency.cross-platform deployment, asymmetric GPU pairs
ExLlamaV2··········EXL2 quants are sharper than GGUF at the same size. Single-stream throughput leader on consumer NVIDIA. NVIDIA + AMD; no Apple.consumer NVIDIA, max single-stream tok/s
TabbyAPI··········OpenAI-compatible HTTP server in front of ExLlamaV2. The production-style wrapper for EXL2 deployments.EXL2-based serving with API
MLX-LM··············Apple-first-party. MLX-4bit / MLX-8bit quants only. Runs on M-series unified memory; no Intel Mac path.Apple Silicon Macs, unified memory deployments
MLX Swift············iOS / iPadOS / macOS app-bundled inference. Production-grade for App Store deployments. Same checkpoints as desktop MLX.iOS app-bundled local LLM inference
MLC LLM·······TVM-based; compiles models for any GPU with Vulkan/Metal/WebGPU. Cross-platform mobile reference. Compile-time overhead.cross-platform mobile + WebGPU
TensorRT-LLM·········Highest peak throughput on H100 / H200 + the FP8 transformer engine. Recompile-per-config friction is real.datacenter NVIDIA at peak throughput
Text Generation Inference (TGI)······Hugging Face's serving runtime. Inference Endpoints + TGI is the HF-native production path; community uptake trails vLLM.HF Inference Endpoints, HF-native deployments
Ray Serve··Orchestration layer wrapping vLLM / SGLang / TGI replicas. Scales request throughput, not single-model size.multi-replica serving + autoscaling
Exo·········Multi-Mac / mixed-device clustering over Thunderbolt + LAN. Layer-shards a single model across machines.multi-Mac clusters, >192GB unified memory targets
Petals·········Bittorrent-style distributed inference. Public-swarm mode is research-tier; private-swarm production deployments exist but rare.research, private compute-pooling
ONNX Runtime·····Microsoft's cross-platform inference runtime. Strongest on classical models; LLM-specific optimizations behind vLLM / llama.cpp.cross-OS / cross-backend production with one runtime
Intel OpenVINO··········Intel-only — Arc GPU + Lunar Lake NPU + AVX-512 CPU paths. The reference runtime for Intel hardware.Intel Arc + Lunar Lake / Meteor Lake NPU
IPEX-LLM········Intel's PyTorch-native LLM runtime. The first-class path for running LLMs on Intel Arc A770/B580 + Lunar Lake NPU + IPEX-Ollama.Intel Arc GPU LLM inference
CTranslate2········Specialized transformer runtime. Whisper (faster-whisper) reference. Encoder-decoder optimization that LLM runtimes don't prioritize.Whisper, NMT, encoder-decoder inference
DirectML··········Windows DirectX 12 inference backend. Vendor-agnostic on Windows: AMD / Intel / Qualcomm GPU + NPU without ROCm or vendor SDKs.Windows multi-vendor GPU + NPU inference
llama-cpp-python······Python bindings + OpenAI-compatible server. The fastest path from `pip install` to a working endpoint. Backend pin via wheel choice.Python-first integration, scripting, prototyping
Aphrodite Engine·······vLLM fork specialized for creative-writing / role-play. Adds DRY / XTC / dynatemp samplers vLLM doesn't ship.SillyTavern + role-play workloads
ExecuTorch············PyTorch's mobile/edge inference runtime. Compiles PyTorch models for Android (NNAPI/GPU/NPU) and iOS (Metal/CoreML).PyTorch-native mobile deployment
Qualcomm AI Hub··············Qualcomm's NPU compiler + model zoo. Snapdragon-only. Pre-quantized variants for Llama / Phi / Gemma / Qwen on Hexagon NPU.Snapdragon NPU production deployment
ONNX Runtime Mobile············Mobile/edge variant of ONNX Runtime. The reference path for Snapdragon X / Lunar Lake / Ryzen AI on Windows + Copilot+ PC NPU.Copilot+ PC NPU, cross-vendor Windows

Going deeper