Best hardware for running local AI models — by budget tier
The actually-useful local-AI buying ladder. Five tiers from $0 to $4000+, each with what models fit comfortably, the common gotchas, and links to the specific hardware pages where they exist. Range-based, not benchmark-faked.
How to read this guide
Hardware buying guides for local AI typically over-promise on the lowest tier and under-explain the upper tiers. This one tries to do the opposite. For each tier we tell you (1) what models fit comfortably, not what fits with desperate quantization, (2) what the experience actually feels like, (3) the most common operator mistake at that tier, and (4) when to skip it for the next one.
Two cross-cutting numbers matter more than anything else: VRAM capacity (the gate — if the model doesn't fit, it doesn't run usefully) and memory bandwidth (the speed limit — once a model fits, generation tok/s scales roughly linearly with bandwidth). Compute (TFLOPS) matters far less than either for inference. Keep this in mind when sales pages emphasize TFLOPS at the expense of memory.
If you want a per-card deep dive after reading this, our choosing a GPU in 2026 guide covers individual SKUs with more granularity. Specific build-comparison reads: dual 3090 vs single 5090 and multi-GPU 2026.
Tier 0 — $0, the laptop you already own
Hardware: any laptop or desktop made since ~2019 with 8 GB+ RAM. No discrete GPU required.
What runs comfortably: Phi-4 Mini, Llama 3.2 3B, Qwen 2.5 3B at Q4 quantization. With 16 GB RAM, Qwen 2.5 7B / Llama 3.1 8B Q4 are usable. Apple Silicon laptops punch well above their weight here because of unified memory bandwidth — an M2 MacBook Air with 16 GB runs 7-8B models at 20-40 tok/s; a same-RAM Intel laptop runs them at 5-15 tok/s.
What it feels like: usable for chat and short-form tasks. Long prompts (5K+ tokens) introduce a multi-second time-to-first-token. Image generation is impractical without a GPU.
Operator gotcha: trying to run a 13-14B model on 16 GB RAM. It fits in memory but leaves nothing for the OS and application working set; you swap; tok/s collapses. Stay at 7-8B class on 16 GB; jump to 14B only when you have a discrete GPU or 32 GB+ RAM.
Skip this tier if: you want any image generation, agent loops, or sub-second latency. Move to Tier 1.
Tier 1 — $300-500, the used 3060/3070 entry
Hardware: a used RTX 3060 12 GB ($200-280), used RTX 3070 8 GB ($250-350), or new RX 7600 XT 16 GB ($300-350). Plus assume the rest of the PC already exists; if not, add a $300-500 used desktop and an $80-150 PSU upgrade. The 3060 12 GB is the dominant pick at this tier — its 12 GB VRAM unlocks materially more than the 8 GB 3070 despite the 3070's higher bandwidth.
What runs comfortably: 7-8B models at Q5/Q6 with usable context, 14B at Q4 with moderate context, 32B at Q4 only with severe context limits. Image generation (SDXL, Stable Diffusion 3 Medium) at 1-3 seconds per image.
What it feels like: the first tier where local AI replaces a paid chat subscription for the user's casual workload. 30-80 tok/s on the right model size.
Operator gotcha: buying an 8 GB card “to start.” The VRAM gap from 8 to 12 GB unlocks 14B-class models and is more important than any speed difference between the two cards. Always pick the higher-VRAM card when the choice is between them at the same price.
Skip this tier if: you specifically need 32B or 70B class models, or you do a lot of long-context work. The 12 GB ceiling will frustrate you fast. Move to Tier 2.
Tier 2 — $700-1000, the used 3090 sweet spot
Hardware: a used RTX 3090 24 GB at $700-900 is the dominant pick — the price-per-VRAM-GB champion. Alternative: used RTX 3090 Ti at $1,000-1,200 (5-10% faster), or new RX 7900 XTX 24 GB at $850-950 if you're committed to Linux + ROCm.
What runs comfortably: 32B AWQ / Q4 with usable context. 70B Q4 with reasonable context. MoE models (Qwen 30B-A3B class) run beautifully here. Image generation at full SDXL quality, 1-second iteration. Coding agents (Aider, Continue.dev) become practical.
What it feels like: the tier where local AI genuinely replaces ChatGPT Plus for most non-frontier tasks. 25-60 tok/s on 32B-class, 15-30 tok/s on 70B Q4. Time-to-first-token under 1 second.
Operator gotcha: undersizing the PSU. The RTX 3090 transient power spikes can hit 500 W+ briefly under load; a 650 W PSU on a full system will trigger intermittent shutdowns. Buy 850 W+ Gold-rated for any single-3090 build, 1000 W+ for dual. The other gotcha: cooling. Used 3090s with poor thermal-pad health throttle silently after 6-12 months; budget for repaste / thermal-pad replacement at year 1.
Skip this tier if: you want a quiet machine, you specifically need to run Llama 70B or Qwen 72B at FP16, or you want to do serious fine-tuning. The 3090's 350 W TDP and fan acoustics are real. Move to Tier 3 or 4.
Tier 3 — $1500-2000, the used 4090 step up
Hardware: a used RTX 4090 24 GB at $1,400-1,700 (the price hasn't dropped much because of crypto and AI demand), or a new RTX 5080 16 GB at $1,200-1,400 (worse VRAM, much newer architecture, FP8 support).
What runs comfortably: the same models as Tier 2, but 30-50% faster. The big practical win is FP8 support on Ada — for users who want vLLM-served models with near-FP16 quality, the 4090 unlocks FP8 quants the 3090 cannot run efficiently.
What it feels like: the same capability as Tier 2 with meaningfully more headroom. The kind of tier where the bottleneck shifts from “does this work?” to “what should I do with this?”
Operator gotcha: the 4090's 12VHPWR connector. Make sure the cable is fully seated; aftermarket cables have caused melting incidents in widely-reported failures in 2023-2024 builds. Use the cable that ships with a quality PSU; check seating monthly.
Skip this tier if: you specifically need more than 24 GB on a single card. The 4090 has the same VRAM as the 3090, just faster. Move to Tier 4 (Apple) or Tier 5 (multi-GPU).
Tier 4 — $2500-3500, Apple M3 Max 64-128 GB
Hardware: a Mac Studio M2/M3 Max with 64-128 GB unified memory ($2,800-4,000 new, $2,200-3,200 used), or a MacBook Pro M3 Max 64 GB ($3,200-4,000) if you want the laptop form factor.
What runs comfortably: 70B Q4 with comfortable context. Multiple models loaded simultaneously. The unified memory is what makes this work — there's no VRAM/RAM split, so a 64 GB Mac runs models that no 24 GB NVIDIA card can hold. Memory bandwidth (M3 Max: ~400 GB/s) is below high-end NVIDIA but above the 3090 on some operations because of architectural advantages.
What it feels like: the quietest, lowest-power, most-set-and-forget local AI experience available at this price. Idle power is single-digit watts; full inference is 30-60 W. No PSU sizing, no thermal-pad replacement, no driver wars.
Operator gotcha: the software ecosystem is narrower than CUDA. Ollama and llama.cpp Metal back-end work excellently. MLX-LM is best-in-class on Apple. But vLLM, ExLlamaV2, TensorRT-LLM don't run. Image-generation tooling (ComfyUI, A1111) lags NVIDIA significantly. If you want broad ecosystem, NVIDIA is correct; if you want low-noise high-VRAM-equivalent, Apple is correct.
Skip this tier if: you need fine-tuning, multi-GPU, or the broadest tool ecosystem. Move to Tier 5.
Tier 5 — $4000+, dual 3090 or M3 Ultra
Hardware (NVIDIA path): dual used RTX 3090 24 GB ($1,400-1,800 for the pair) + workstation motherboard with 2x PCIe x16 slots + 1200 W+ Gold PSU + larger case ($800-1,200). Total: $3,500-5,000.
Hardware (Apple path): Mac Studio M3 Ultra with 192-512 GB unified memory ($5,500-9,500). The 192 GB tier is the practical entry; the 512 GB tier exists for users who want to run 671B-class MoE models at home.
What runs comfortably: dual 3090 with NVLink runs 70B Q4 fast, 70B Q5 comfortably, and supports tensor-parallel for higher throughput. 100B-class MoE in 4-bit is approachable. Fine-tuning becomes practical. The Apple Ultra path runs 70B at FP16, 120B+ in 4-bit, and the 512 GB tier runs DeepSeek V3 671B-A37B in 4-bit at home.
What it feels like: serious tinkerer or production-prosumer territory. The dual 3090 build is loud, hot, and rewards operator skill. The M3 Ultra build is silent, expensive, and just works.
Operator gotcha (NVIDIA): expecting NVLink to pool VRAM. NVLink accelerates communication between cards but does NOT make them appear as a single GPU to the runtime — you still need software that supports tensor-parallelism (vLLM, ExLlamaV2 with TabbyAPI, llama.cpp's split-mode). Read more in running local AI on multiple GPUs.
Operator gotcha (Apple): the M3 Ultra at 512 GB is a $9,500 commitment for a use case (671B MoE inference at home) that fewer than 1% of users actually need. Most people get more value from the 192 GB tier.
AMD vs NVIDIA: when AMD wins
AMD's RX 7900 XTX 24 GB at $850-950 is genuinely competitive on Linux. ROCm in 2026 supports llama.cpp, vLLM, and ExLlamaV2 reliably. The bandwidth (960 GB/s) is excellent. The price is $400-700 below a comparable used 3090.
AMD wins when: you're on Linux, you only do inference (not fine-tuning), and you don't need image generation. AMD loses on Windows (ROCm Windows support has improved but still trails Linux substantially), on training (ROCm fine-tuning is rougher), and on the breadth of tooling.
The mistakes operators make at every tier
- Buying for compute over VRAM. The number that decides what you can run is VRAM. Always optimize VRAM first.
- Buying for “future models.” Buy for what you'll run this month. The field moves fast enough that “future-proofing” rarely pays off.
- Skipping PSU sizing. AI inference produces transient spikes that gaming benchmarks don't. Size up.
- Ignoring case airflow. A 3090 in a tight case throttles within 30 minutes of sustained inference. Verify case airflow before buying.
- Forgetting electricity cost. A 350 W card running 4 hours a day at $0.15/kWh is $77/year. Not crippling, but real.
How to commit
Three concrete next steps depending on which tier you're landing in:
- Tier 0 / 1: run /will-it-run/custom with your current specs first. You may already be at Tier 1 capability.
- Tier 2 / 3: read the RTX 3090 and RTX 4090 hardware pages for the full spec breakdown, then buy used through eBay or local listings.
- Tier 4 / 5: read /hardware/apple-m3-ultra and dual 3090 vs single 5090 before committing the larger budget. The trade-offs are real.
Adjacent reading: Can I run AI locally on my computer? for the foundation, /systems/quantization-formats for what fits how on each tier, and the full /hardware directory for everything else.