Compare local AI options

Every comparison surface here scores along the dimensions that matter when an operator has to live with the decision: maintenance burden, reproducibility, lock-in risk, privacy, offline capability, operational complexity, benchmark freshness, and trust coverage.

We publish these because no inference engine vendor will ever compare itself neutrally to its competitors. Our domain is the analytical layer above any single runtime, model, or marketplace — that's the only place neutral comparison can live.

Runtimes — vLLM vs llama.cpp vs Ollama vs MLX →

Cross-engine comparison on the operational dimensions that matter: maintenance burden, reproducibility, OS support, lock-in, observability.

Open comparison →

Quantization tiers — FP16 vs Q8 vs Q5 vs Q4 vs Q3 →

Speed, memory, and quality tradeoffs across quant bits buckets. Acknowledges what each tier gives up.

Open comparison →

Hardware tiers — laptop vs workstation vs homelab vs rack →

What each tier can run, what breaks first, what it costs to operate. Acknowledges thermal + power realism.

Open comparison →

Local vs cloud inference →

Privacy, latency, lock-in, predictable cost, offline capability. Honest about cloud being faster on raw speed.

Open comparison →

Operator total cost of ownership →

3-year amortized cost of running local AI: hardware + electricity + downtime + operator hours. Cloud break-even points.

Open comparison →

Engine vs engine — head-to-head runtime pairs →

Direct one-on-one comparisons: vLLM vs SGLang, Ollama vs llama.cpp, MLX vs llama.cpp, TensorRT-LLM vs vLLM, and more.

Open comparison →

Hardware vs hardware — head-to-head GPU pairs →

Direct buyer comparisons: RTX 4090 vs 5090, dual 3090 vs 5090, M4 Max vs 4090, RX 7900 XTX vs 4090. Operator-grade tradeoffs.

Open comparison →

Build your own hardware comparison →

Pick any two cards from the catalog and get a side-by-side decision card with effective-VRAM math, CUDA-vs-ROCm caveats, and used-market notes.

Open comparison →

How we score

Every cell uses one of five qualitative tiers — excellent, strong, acceptable, limited, poor — with a one-line caveat that names the assumption. We never render all-green for any option; if a runtime wins on speed, the comparison surfaces what it costs you in operator hours.

Tiers are editorial. The underlying benchmark numbers come from the public corpus (editorial + reproduced community submissions). When the corpus is too thin to produce a confident tier, we render “n/a” and link to the benchmark roadmap so contributors know where to fill the gap.

When to use each comparison surface

The comparisons above answer different operator questions. Picking the wrong surface wastes your time. Hardware vs hardware is the right page when you've narrowed to two specific GPUs and want the buyer-grade tradeoff. Engine vs engine (vLLM vs SGLang, Ollama vs llama.cpp) is the right page when your hardware is decided and you're choosing how to serve. Quantization comparison is the right page when both your hardware and your runtime are decided and you're tuning for the model fit at your context length.

The local-vs-cloud surface is the most-asked but least decisive — it answers an “is this even worth it?” question that the rest of the site assumes you've already answered. Use it for sanity-checks, or share it with someone trying to convince stakeholders that local AI is operationally feasible.

What you won't find on these pages

No “ultimate winner” verdicts. Most local-AI decisions are configuration-dependent — what wins for a single-user homelab loses for a 50-RPS production deployment, and vice versa. The comparisons surface the tradeoff dimensions and let you weigh them against your own constraints. The /will-it-run engine and the buyer-guide cluster are where the “what should I actually buy” question gets its operator-grade answer; the comparison pages are the input to that decision, not the output.