Choosing a GPU for local AI in 2026
Which GPU should you actually buy for running LLMs locally? Honest tier breakdown across consumer NVIDIA, AMD, Apple Silicon, and the used market. No hedging, no listicle filler — we tell you what we'd buy at each price point and why.
The bottom line up front
Most people should buy: a used RTX 3090 ($700-900) or new RTX 5070 Ti / 4060 Ti 16GB depending on budget.
Don't buy: any new 8 GB card for AI work, or the RTX 5090 unless you specifically need 32 GB.
The single number that matters most: VRAM capacity. The number that matters second-most: memory bandwidth.
Apple Silicon plot twist: M4 Max with 128 GB unified memory is the simplest path to running 70B-class models locally without multi-card setups.
The two numbers that decide everything
Most GPU comparisons get distracted by gaming benchmarks, ray-tracing performance, or the MSRP. For local AI, those numbers are mostly irrelevant. The two numbers that actually determine whether a card is good for running LLMs are VRAM capacity and memory bandwidth, in that order.
VRAM is the gate. If a model doesn't fit, it doesn't run — full stop. You can spill into system RAM via CPU offload, but you'll go from 40 tokens/sec to 2-3 tokens/sec. That's the difference between "this is a useful tool" and "I'd rather use ChatGPT." So the first question is always: how much VRAM does the card have?
Memory bandwidth is the speed limit. Once a model fits in VRAM, the speed at which it generates tokens is almost entirely bound by how fast the GPU can read its own weights from memory. Tokens-per-second scales roughly linearly with bandwidth. The RTX 5090 has 1,792 GB/s; the RTX 5060 has about 280 GB/s. That's a 6× gap in real inference speed even on a model that fits comfortably on both.
Compute (TFLOPS) matters for prefill — the time before the first token appears. But for sustained generation, it's the bandwidth that determines whether you're waiting or working.
Tier breakdown
Below are six tiers based on VRAM capacity and price-per-VRAM-GB. Each tier corresponds to a different class of model you can comfortably run. Prices are 2026 mid-year street; expect drift.
Entry tier — 8 GB cards
Examples: RTX 5060 ($299), RTX 4060 ($269), RTX 3060 8GB ($249), RX 7600 XT 16GB ($309 — better pick if you're shopping at this price).
What runs: 7B models in 4-bit quantization with reduced context. You can chat with Llama 3.1 8B or Qwen 3 8B reasonably well. You cannot run 14B-class models comfortably.
Our verdict: Skip the 8 GB tier entirely if AI is your primary use case. The performance gap from 8 GB to 12 GB is much larger than the price gap, and 8 GB cards cap your future as the field moves toward larger models. The single exception is the RX 7600 XT — at $309 it's the cheapest 16 GB card on the new market, which is a different conversation entirely.
Mid tier — 12 GB cards
Examples: RTX 3060 12GB ($249 new, $200 used — the value floor), RTX 4070 Super ($619), RTX 5070 ($599).
What runs: 14B models in 4-bit with usable context. 7B models with full context and high-quality quantization. Borderline 32B with severe quant compression.
Our verdict: The RTX 3060 12GB is the cheapest serious entry into local AI and remains a coherent buy in 2026 even though it's a 2021 card. The RTX 4070 Super is the mid-range pick if you're buying new — it has 1.5× the bandwidth of the 3060 and runs 14B models meaningfully faster. The RTX 5070 has the same VRAM as both with newer architecture but minimal real-world advantage; only buy at MSRP, never above.
High mid tier — 16 GB cards (the sweet spot)
Examples: RTX 4060 Ti 16GB ($449), RTX 5060 Ti 16GB ($459), RX 9070 XT ($649), RX 7800 XT ($459), RTX 4080 Super ($1,099), RTX 5080 ($1,199).
What runs: 14B comfortably with full context. 32B models in 4-bit with reasonable context. 70B-class models become possible with aggressive offload (slow).
Our verdict: This tier is where the genuine "local AI is fun" experience starts. The RTX 4060 Ti 16GB at $450 is the value pick — its memory bandwidth is mediocre but the VRAM-per-dollar is unmatched on the new market. If you have $1,200, the RTX 5080 gives you the same 16 GB but with 2-3× the bandwidth and 2026-current architecture. The RX 9070 XT is the AMD pick if you're on Linux and want to skip the NVIDIA tax — same 16 GB, comparable bandwidth, $200 cheaper than the 5080.
Don't get confused by: the RTX 4060 Ti 8GB. Despite the same name, it's a materially different card with half the VRAM. Always verify "16GB" is in the listing.
Enthusiast tier — 24 GB cards
Examples: RTX 3090 ($899 used, the champion of price-per-VRAM-GB), RTX 3090 Ti ($1,199 used), RTX 4090 ($1,899 new), RX 7900 XTX ($899 new).
What runs: 32B models comfortably. 70B in 4-bit with reasonable context. MoE models like Qwen 3 30B-A3B run beautifully here.
Our verdict: If we had to recommend a single card for serious local AI in 2026, it would be a used RTX 3090 at $700-900. You get the same 24 GB as the RTX 4090 at one-third the price, with about 25 percent slower inference. The bandwidth (936 GB/s) is excellent. The community has stress-tested this card for four years.
The RTX 4090 is faster but the price premium over the 3090 is hard to justify unless you specifically need the speed. The RX 7900 XTX is the sleeper pick on Linux — same 24 GB, comparable bandwidth, $1,000 cheaper than a 4090. ROCm support is now solid for inference (training is still rougher).
Top tier — 32 GB single card
Example: RTX 5090 ($2,499 street).
What runs: 70B comfortably with context to spare. 100B-class MoE in 4-bit. 120B+ becomes possible with aggressive offload. Practical multi-model workflows where you keep a primary 32B + a draft model loaded simultaneously.
Our verdict: The RTX 5090 is the only consumer card that crosses 24 GB, and its $2,500 price reflects that exclusivity tax. If you specifically need 32 GB on a single card — for very long contexts, multi-model setups, or 70B+ models without offload tricks — there's no competitor. Otherwise, two used RTX 3090s for the same price gives you 48 GB combined via tensor parallelism, with the catch that not every workflow supports multi-card.
Workstation tier — 48-96 GB
Examples: RTX PRO 6000 Blackwell ($8,499, 96 GB), RTX 6000 Ada ($6,499, 48 GB), RTX A6000 ($3,500 used, 48 GB).
What runs: 70B at FP16 or large context. 120B-class in 4-bit. The largest MoE models (Qwen 3 235B-A22B) become approachable.
Our verdict: The used RTX A6000 is genuinely interesting for serious tinkerers — 48 GB at $3,500 used puts you in workstation territory at half the new price. The RTX PRO 6000 Blackwell at $8,500 is for production users who need 96 GB and full warranty support; for hobbyists it's hard to justify over multi-3090 setups.
Apple Silicon — the unified-memory plot twist
Apple's M-series chips are unique in that the GPU and CPU share the same memory pool. There is no PCIe transfer cost, no fragmentation between VRAM and system RAM. A Mac Studio with M3 Ultra and 192 GB of unified memory can load models that no consumer NVIDIA card can hold.
The relevant Apple chips for local AI:
- M4 Max with 64-128 GB. 546 GB/s memory bandwidth. Runs 70B Q4 comfortably. The premium AI laptop pick.
- M3 Ultra with 96-512 GB. 819 GB/s bandwidth. Mac Studio configurations top out at 512 GB unified — enough to run 671B-class MoE models in 4-bit.
- M4 Pro with 24-48 GB. 273 GB/s. Solid for 14-32B class.
The catch is the software ecosystem. MLX (Apple's native ML framework) performs excellently — sometimes 70-90 percent of CUDA throughput on equivalent hardware — but the breadth of supported tools and quantization formats is narrower. llama.cpp on Metal is mature; vLLM and ExLlamaV2 don't run on Apple. Image-generation tooling (ComfyUI, Stable Diffusion WebUI) lags significantly behind NVIDIA.
Our verdict: If you primarily run LLMs (not image gen), and you value quiet/low-power, M4 Max with 64 GB is the simplest path to "I can run any consumer-class model" without dealing with discrete GPUs. The configurations to look for: M4 Max 16-core GPU with 64 GB ($3,999) or M3 Ultra Mac Studio with 192 GB ($5,799+).
The used market math
The price-per-VRAM-GB on the used market in 2026 is dominated by NVIDIA's 30-series. A used RTX 3090 24GB at $850 works out to $35/GB — versus the RTX 5090's $78/GB at $2,500. That's more than 2× the value, with the trade being about 25 percent slower inference and older architecture.
The used cards we still actively recommend in 2026:
- RTX 3090 at $700-900 — the price-per-GB champion
- RTX 3090 Ti at $1,000-1,200 — slightly faster, same VRAM
- RTX 3060 12GB at $200-250 — entry-level value floor
- RTX A6000 at $3,000-3,800 — workstation tier without retail markup
We do not recommend used RTX 4090s — their used prices haven't dropped enough below new to justify the warranty risk. We do not recommend used 8 GB cards of any era for AI use.
The mistakes to avoid
- Buying for compute over VRAM. The RTX 5060 has more compute than an RTX 3060 12GB, but the 3060 runs more models because of the 4 GB VRAM advantage. Always optimize for VRAM first.
- Assuming "more recent" means "better for AI." Generation matters less than VRAM and bandwidth. A used RTX 3090 outperforms a new RTX 5070 for AI specifically.
- Believing memory-bandwidth-blind benchmarks. Gaming benchmarks measure compute. AI benchmarks should measure memory bandwidth. They tell different stories.
- Going AMD on Windows for AI. ROCm Windows support has improved but still trails Linux substantially. If you must run Windows, NVIDIA still wins for AI.
- Skipping cooling consideration on laptops. Laptop GPUs throttle under sustained inference load. A "RTX 4090 Mobile 16GB" laptop with bad thermals will underperform an RTX 4070 Ti Super desktop in real use. Read sustained-load reviews.
- Buying for what you might run someday. Buy for the model you will run this month. The field moves fast enough that hardware bought "for future 200B models" will be eclipsed by something better before those models matter to you.
The buy-or-wait calculation in 2026
NVIDIA's RTX 60-series is rumored for late 2027. AMD's RDNA 5 (RX 10000-series) is rumored for the same window. Apple's M5 family is expected late 2026.
If you're shopping in mid-2026, you have a clear window: the RTX 50 series and RX 9070 line are at MSRP availability, the used 30-series is plentiful, and Apple's M4 Max is shipping. If you wait until late 2027 hoping for the next generation, you'll wait through six months of supply constraint and pay launch premiums when it ships. Buy now for the model size you actually want to run, with the assumption that you'll repeat this calculation in 18 months.
Where to go from here
- Check exactly what your hardware can run with our compatibility tool: Will it run? — find what your GPU can run
- Browse the full hardware database with bandwidth specs and benchmarks: Hardware directory
- See benchmarks tagged by source (owner-run, community, vendor): Benchmarks
- Read our methodology page for how we calculate tokens-per-second predictions
Frequently asked
Is the RTX 5090 worth $2,500 for local AI?
Can I just buy an RTX 4060 Ti 16GB for local AI?
Should I get an Apple M-series chip instead of a discrete GPU?
Is AMD finally usable for local AI in 2026?
How much VRAM do I really need?
This guide is hand-curated by RunLocalAI Editorial. It is reviewed at least quarterly and on each major hardware launch. If a number on this page is wrong or outdated, email corrections@runlocalai.co and we'll update within a week.
See our editorial policy for how we research and verify hardware claims, and our how we make money page for affiliate-link disclosures.