Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Each preset reflects a build a real operator would actually own. Click to load — then tweak any field below. Don’t know which GPU yet? Run the GPU chooser first to narrow the field by budget, OS, and workload.
A used RTX 4060 Ti 16 GB or RTX 4070 Ti Super, modest CPU, 32 GB system RAM, Linux. The cheapest realistic local-AI starter that runs 13B-cl…
The canonical 70B-on-a-budget setup. Two used RTX 3090 cards (24 GB each), modern high-end CPU, 128 GB RAM, Ubuntu 24.04, vLLM or ExLlamaV2.…
The workstation default for serious solo-user local AI. 24 GB VRAM, 64 GB system RAM, modern high-end CPU, NVMe storage. Runs 32B-class mode…
32 GB VRAM, 1.79 TB/s memory bandwidth, native FP4 acceleration. The 2026 next-gen consumer flagship. Comfortably runs 32B-class FP16 and pu…
MacBook Pro M4 Max with 64-128 GB unified memory, MLX-LM as the engine. Battery-aware single-machine inference for 32B-class models with no …
The only realistic single-machine path to 70B FP16 outside a datacenter. 192 GB unified memory, near-silent operation, MLX-LM as the canonic…
Ubuntu 24.04 + ROCm 6.x + RX 7900 XTX (24 GB). The cheapest 24 GB VRAM AMD path; pairs with llama.cpp HIPBLAS for the most reliable AMD infe…
RTX 4070 Ti Super + Windows 11 + LM Studio. The smoothest possible introduction to local AI on Windows — no compilation, no driver wrestling…
RTX 4090 + 64 GB RAM + Ubuntu 24.04 + vLLM serving Qwen 2.5 Coder 32B AWQ-INT4 at 32K context. The reference autonomous-coding-agent setup.
Single-user document-search and Q&A on a 4090. Qwen 2.5 14B + nomic-embed-text + Qdrant in Docker. Fits documents of arbitrary size; long-co…
An asymmetric build for VRAM-rich experimentation. llama.cpp layer-split with --tensor-split distributes by VRAM ratio. Not a clean producti…
Four used RTX 3090s on a server motherboard, Ubuntu 24.04 + vLLM. 96 GB aggregate VRAM with tensor-parallel for 70B AWQ + concurrent users, …
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
No GPU slots — pick one below or add multiple slots for mixed-GPU builds. Leave empty for CPU-only inference.