Operational comparisons
Editorial

Compare local AI options

Every comparison surface here scores along the dimensions that matter when an operator has to live with the decision: maintenance burden, reproducibility, lock-in risk, privacy, offline capability, operational complexity, benchmark freshness, and trust coverage.

We publish these because no inference engine vendor will ever compare itself neutrally to its competitors. Our domain is the analytical layer above any single runtime, model, or marketplace — that's the only place neutral comparison can live.

Runtimes — vLLM vs llama.cpp vs Ollama vs MLX

Cross-engine comparison on the operational dimensions that matter: maintenance burden, reproducibility, OS support, lock-in, observability.

Quantization tiers — FP16 vs Q8 vs Q5 vs Q4 vs Q3

Speed, memory, and quality tradeoffs across quant bits buckets. Acknowledges what each tier gives up.

Hardware tiers — laptop vs workstation vs homelab vs rack

What each tier can run, what breaks first, what it costs to operate. Acknowledges thermal + power realism.

Local vs cloud inference

Privacy, latency, lock-in, predictable cost, offline capability. Honest about cloud being faster on raw speed.

Operator total cost of ownership

3-year amortized cost of running local AI: hardware + electricity + downtime + operator hours. Cloud break-even points.

Engine vs engine — head-to-head runtime pairs

Direct one-on-one comparisons: vLLM vs SGLang, Ollama vs llama.cpp, MLX vs llama.cpp, TensorRT-LLM vs vLLM, and more.

Hardware vs hardware — head-to-head GPU pairs

Direct buyer comparisons: RTX 4090 vs 5090, dual 3090 vs 5090, M4 Max vs 4090, RX 7900 XTX vs 4090. Operator-grade tradeoffs.

Build your own hardware comparison

Pick any two cards from the catalog and get a side-by-side decision card with effective-VRAM math, CUDA-vs-ROCm caveats, and used-market notes.

How we score

Every cell uses one of five qualitative tiers — excellent, strong, acceptable, limited, poor — with a one-line caveat that names the assumption. We never render all-green for any option; if a runtime wins on speed, the comparison surfaces what it costs you in operator hours.

Tiers are editorial. The underlying benchmark numbers come from the public corpus (editorial + reproduced community submissions). When the corpus is too thin to produce a confident tier, we render “n/a” and link to the benchmark roadmap so contributors know where to fill the gap.

When to use each comparison surface

The comparisons above answer different operator questions. Picking the wrong surface wastes your time. Hardware vs hardware is the right page when you've narrowed to two specific GPUs and want the buyer-grade tradeoff. Engine vs engine (vLLM vs SGLang, Ollama vs llama.cpp) is the right page when your hardware is decided and you're choosing how to serve. Quantization comparison is the right page when both your hardware and your runtime are decided and you're tuning for the model fit at your context length.

The local-vs-cloud surface is the most-asked but least decisive — it answers an “is this even worth it?” question that the rest of the site assumes you've already answered. Use it for sanity-checks, or share it with someone trying to convince stakeholders that local AI is operationally feasible.

What you won't find on these pages

No “ultimate winner” verdicts. Most local-AI decisions are configuration-dependent — what wins for a single-user homelab loses for a 50-RPS production deployment, and vice versa. The comparisons surface the tradeoff dimensions and let you weigh them against your own constraints. The /will-it-run engine and the buyer-guide cluster are where the “what should I actually buy” question gets its operator-grade answer; the comparison pages are the input to that decision, not the output.