Frontier zone · Inference runtimes

The inference-runtime frontier

What's accelerating in the runtime layer. vLLM remains the production default; SGLang is the architectural challenger; Exo turned consumer Mac clusters into a credible serving option in early 2026. Pair with /maps/inference-runtimes-2026 for the structured-landscape view.

ExplodingDistributed inference
30k+5k/30d
Exo

The 2026 breakthrough release for consumer-cluster inference. Thunderbolt 5 + macOS 26.2 RDMA cut inter-Mac latency by ~99% on M4 Pro+ hardware. DeepSeek V3 671B running at 5.37 tok/s on 8x M4 Pro Mac Minis is now a credible personal-cluster benchmark, not a tech demo. The architectural shift this represents: consumer hardware can now run frontier-class models locally.

Architecture: Pipeline parallel via MLX over Thunderbolt 5 RDMA. Auto-discovery of nearby Apple Silicon devices. The first credible WAN-or-LAN-cluster inference solution where consumer Mac hardware genuinely competes with datacenter SKUs on tokens-per-watt.

ExplodingInference runtime
52k+3k/30d
vLLM

Production-default inference engine. v0.17.1 (March 2026) shipped Model Runner V2 with up to 56% higher throughput on GB200. PagedAttention turned KV-cache efficiency into a 5-24x throughput delta over baselines; the project's discipline through 2024-2026 turned that single innovation into a complete production stack.

Architecture: PagedAttention + continuous batching + prefix caching + chunked prefill. The OpenAI-compatible API on top makes it a drop-in for any team running an OpenAI bill they'd rather not pay.

RisingInference runtime
14k+2k/30d
SGLang

The credible architectural alternative to vLLM. RadixAttention's tree-structured KV cache is a real advantage on shared-prefix traffic; the SGL DSL's structured-generation primitives turn 5-10x token efficiency into a defensible feature for any workload that already enforces output structure client-side.

Architecture: Tree-structured KV cache (vs vLLM's flat blocks) + structured-generation DSL. Cross-replica prefix-cache sync makes the architectural advantage compound at multi-node scale.

RisingApple Silicon
5k+600/30d
MLX-LM

Apple's Metal-native ML framework's LLM runner. Now competitive with llama.cpp Metal on M-series silicon, with stronger long-context performance. The 2026 unlock here was Thunderbolt 5 + macOS 26.2 RDMA, which made multi-Mac clusters credible — see Exo.

Architecture: Pure Metal kernels; unified-memory-aware. The MLX quant format is separate from GGUF, which is the main compatibility gap.

RisingROCm tooling
6k+350/30d
ROCm

AMD's CUDA equivalent. ROCm 6.2+ matured through 2025; the gap with CUDA is narrowing on the headline LLaMA / Mistral / Qwen architectures. RX 7900 XTX on ROCm runs Llama 3.1 8B Q4_K_M at ~86 tok/s — within 17% of RTX 4090. The trajectory matters: AMD viability for local AI improved more in 2025-2026 than in any prior 18-month period.

Architecture: Kernel coverage trails CUDA; some attention variants regress. Verify your model's specific architecture has a working ROCm path before committing.

StableInference runtime
132k+2k/30d
Ollama

The default first-pull tool for every newcomer to local AI. The curated model library and zero-config setup beat every alternative on time-to-first-token. Mature; the project's ergonomic moat is genuine — most chat-model users never need anything more.

StableInference runtime
92k+2k/30d
llama.cpp

The bedrock most other runtimes sit on. Ollama wraps it; LM Studio bundles it; Llamafile ships it as one binary. Every quant kernel improvement propagates to all of them. Mature; no architectural breaks expected — the project's value is steady kernel-level progress.

Architecture: C++ inference engine with first-class GGUF format. Every consumer-tier local AI runtime that isn't MLX or ExLlamaV2 is a wrapper around this.

StableQuantization
6k+220/30d
ExLlamaV2

GPU-only inference library optimized for consumer NVIDIA cards. Fastest tokens-per-second on a single 24GB card for 30B-class models in EXL2 quant. Stable; the EXL2 ecosystem is narrower than GGUF but the speed advantage is real for committed users.

StableDistributed inference
10k+200/30d
Petals

BitTorrent-style decentralized LLM inference. The architectural extreme: 'internet is the cluster.' ~6 tok/s on Llama-2 70B in the public swarm; viable when you can't fit the model anywhere and don't have a GPU cluster. Mature; growth is steady but no longer explosive — Exo's rise has shifted distributed-inference attention to controlled clusters.

CoolingInference runtime
10k+80/30d
Text Generation Inference (TGI)

The 2023-2024 production default; vLLM ate that lunch through 2024-2025. TGI still has tighter HF Hub integration and slightly nicer ops surface, but new deployments default to vLLM unless HF integration matters specifically. The ecosystem has shifted; momentum has moved to the alternatives.

Going deeper