Choosing a GPU for local AI in 2026 — the practical guide

The bottom line up front

Most people should buy: a used RTX 3090 ($700-900) or new RTX 5070 Ti / 4060 Ti 16GB depending on budget.

Don't buy: any new 8 GB card for AI work, or the RTX 5090 unless you specifically need 32 GB.

The single number that matters most: VRAM capacity. The number that matters second-most: memory bandwidth.

Apple Silicon plot twist: M4 Max with 128 GB unified memory is the simplest path to running 70B-class models locally without multi-card setups.

The two numbers that decide everything

Most GPU comparisons get distracted by gaming benchmarks, ray-tracing performance, or the MSRP. For local AI, those numbers are mostly irrelevant. The two numbers that actually determine whether a card is good for running LLMs are VRAM capacity and memory bandwidth, in that order.

VRAM is the gate. If a model doesn't fit, it doesn't run — full stop. You can spill into system RAM via CPU offload, but you'll go from 40 tokens/sec to 2-3 tokens/sec. That's the difference between "this is a useful tool" and "I'd rather use ChatGPT." So the first question is always: how much VRAM does the card have?

Memory bandwidth is the speed limit. Once a model fits in VRAM, the speed at which it generates tokens is almost entirely bound by how fast the GPU can read its own weights from memory. Tokens-per-second scales roughly linearly with bandwidth. The RTX 5090 has 1,792 GB/s; the RTX 5060 has about 280 GB/s. That's a 6× gap in real inference speed even on a model that fits comfortably on both.

Compute (TFLOPS) matters for prefill — the time before the first token appears. But for sustained generation, it's the bandwidth that determines whether you're waiting or working.

Tier breakdown

Below are six tiers based on VRAM capacity and price-per-VRAM-GB. Each tier corresponds to a different class of model you can comfortably run. Prices are 2026 mid-year street; expect drift.

Entry tier — 8 GB cards

Examples: RTX 5060 ($299), RTX 4060 ($269), RTX 3060 8GB ($249), RX 7600 XT 16GB ($309 — better pick if you're shopping at this price).

What runs: 7B models in 4-bit quantization with reduced context. You can chat with Llama 3.1 8B or Qwen 3 8B reasonably well. You cannot run 14B-class models comfortably.

Our verdict: Skip the 8 GB tier entirely if AI is your primary use case. The performance gap from 8 GB to 12 GB is much larger than the price gap, and 8 GB cards cap your future as the field moves toward larger models. The single exception is the RX 7600 XT — at $309 it's the cheapest 16 GB card on the new market, which is a different conversation entirely.

Mid tier — 12 GB cards

Examples: RTX 3060 12GB ($249 new, $200 used — the value floor), RTX 4070 Super ($619), RTX 5070 ($599).

What runs: 14B models in 4-bit with usable context. 7B models with full context and high-quality quantization. Borderline 32B with severe quant compression.

Our verdict: The RTX 3060 12GB is the cheapest serious entry into local AI and remains a coherent buy in 2026 even though it's a 2021 card. The RTX 4070 Super is the mid-range pick if you're buying new — it has 1.5× the bandwidth of the 3060 and runs 14B models meaningfully faster. The RTX 5070 has the same VRAM as both with newer architecture but minimal real-world advantage; only buy at MSRP, never above.

High mid tier — 16 GB cards (the sweet spot)

Examples: RTX 4060 Ti 16GB ($449), RTX 5060 Ti 16GB ($459), RX 9070 XT ($649), RX 7800 XT ($459), RTX 4080 Super ($1,099), RTX 5080 ($1,199).

What runs: 14B comfortably with full context. 32B models in 4-bit with reasonable context. 70B-class models become possible with aggressive offload (slow).

Our verdict: This tier is where the genuine "local AI is fun" experience starts. The RTX 4060 Ti 16GB at $450 is the value pick — its memory bandwidth is mediocre but the VRAM-per-dollar is unmatched on the new market. If you have $1,200, the RTX 5080 gives you the same 16 GB but with 2-3× the bandwidth and 2026-current architecture. The RX 9070 XT is the AMD pick if you're on Linux and want to skip the NVIDIA tax — same 16 GB, comparable bandwidth, $200 cheaper than the 5080.

Don't get confused by: the RTX 4060 Ti 8GB. Despite the same name, it's a materially different card with half the VRAM. Always verify "16GB" is in the listing.

Enthusiast tier — 24 GB cards

Examples: RTX 3090 ($899 used, the champion of price-per-VRAM-GB), RTX 3090 Ti ($1,199 used), RTX 4090 ($1,899 new), RX 7900 XTX ($899 new).

What runs: 32B models comfortably. 70B in 4-bit with reasonable context. MoE models like Qwen 3 30B-A3B run beautifully here.

Our verdict: If we had to recommend a single card for serious local AI in 2026, it would be a used RTX 3090 at $700-900. You get the same 24 GB as the RTX 4090 at one-third the price, with about 25 percent slower inference. The bandwidth (936 GB/s) is excellent. The community has stress-tested this card for four years.

The RTX 4090 is faster but the price premium over the 3090 is hard to justify unless you specifically need the speed. The RX 7900 XTX is the sleeper pick on Linux — same 24 GB, comparable bandwidth, $1,000 cheaper than a 4090. ROCm support is now solid for inference (training is still rougher).

Top tier — 32 GB single card

Example: RTX 5090 ($2,499 street).

What runs: 70B comfortably with context to spare. 100B-class MoE in 4-bit. 120B+ becomes possible with aggressive offload. Practical multi-model workflows where you keep a primary 32B + a draft model loaded simultaneously.

Our verdict: The RTX 5090 is the only consumer card that crosses 24 GB, and its $2,500 price reflects that exclusivity tax. If you specifically need 32 GB on a single card — for very long contexts, multi-model setups, or 70B+ models without offload tricks — there's no competitor. Otherwise, two used RTX 3090s for the same price gives you 48 GB combined via tensor parallelism, with the catch that not every workflow supports multi-card.

Workstation tier — 48-96 GB

Examples: RTX PRO 6000 Blackwell ($8,499, 96 GB), RTX 6000 Ada ($6,499, 48 GB), RTX A6000 ($3,500 used, 48 GB).

What runs: 70B at FP16 or large context. 120B-class in 4-bit. The largest MoE models (Qwen 3 235B-A22B) become approachable.

Our verdict: The used RTX A6000 is genuinely interesting for serious tinkerers — 48 GB at $3,500 used puts you in workstation territory at half the new price. The RTX PRO 6000 Blackwell at $8,500 is for production users who need 96 GB and full warranty support; for hobbyists it's hard to justify over multi-3090 setups.

Apple Silicon — the unified-memory plot twist

Apple's M-series chips are unique in that the GPU and CPU share the same memory pool. There is no PCIe transfer cost, no fragmentation between VRAM and system RAM. A Mac Studio with M3 Ultra and 192 GB of unified memory can load models that no consumer NVIDIA card can hold.

The relevant Apple chips for local AI:

M4 Max with 64-128 GB. 546 GB/s memory bandwidth. Runs 70B Q4 comfortably. The premium AI laptop pick.
M3 Ultra with 96-512 GB. 819 GB/s bandwidth. Mac Studio configurations top out at 512 GB unified — enough to run 671B-class MoE models in 4-bit.
M4 Pro with 24-48 GB. 273 GB/s. Solid for 14-32B class.

The catch is the software ecosystem. MLX (Apple's native ML framework) performs excellently — sometimes 70-90 percent of CUDA throughput on equivalent hardware — but the breadth of supported tools and quantization formats is narrower. llama.cpp on Metal is mature; vLLM and ExLlamaV2 don't run on Apple. Image-generation tooling (ComfyUI, Stable Diffusion WebUI) lags significantly behind NVIDIA.

Our verdict: If you primarily run LLMs (not image gen), and you value quiet/low-power, M4 Max with 64 GB is the simplest path to "I can run any consumer-class model" without dealing with discrete GPUs. The configurations to look for: M4 Max 16-core GPU with 64 GB ($3,999) or M3 Ultra Mac Studio with 192 GB ($5,799+).

The used market math

The price-per-VRAM-GB on the used market in 2026 is dominated by NVIDIA's 30-series. A used RTX 3090 24GB at $850 works out to $35/GB — versus the RTX 5090's $78/GB at $2,500. That's more than 2× the value, with the trade being about 25 percent slower inference and older architecture.

The used cards we still actively recommend in 2026:

RTX 3090 at $700-900 — the price-per-GB champion
RTX 3090 Ti at $1,000-1,200 — slightly faster, same VRAM
RTX 3060 12GB at $200-250 — entry-level value floor
RTX A6000 at $3,000-3,800 — workstation tier without retail markup

We do not recommend used RTX 4090s — their used prices haven't dropped enough below new to justify the warranty risk. We do not recommend used 8 GB cards of any era for AI use.

The mistakes to avoid

Buying for compute over VRAM. The RTX 5060 has more compute than an RTX 3060 12GB, but the 3060 runs more models because of the 4 GB VRAM advantage. Always optimize for VRAM first.
Assuming "more recent" means "better for AI." Generation matters less than VRAM and bandwidth. A used RTX 3090 outperforms a new RTX 5070 for AI specifically.
Believing memory-bandwidth-blind benchmarks. Gaming benchmarks measure compute. AI benchmarks should measure memory bandwidth. They tell different stories.
Going AMD on Windows for AI. ROCm Windows support has improved but still trails Linux substantially. If you must run Windows, NVIDIA still wins for AI.
Skipping cooling consideration on laptops. Laptop GPUs throttle under sustained inference load. A "RTX 4090 Mobile 16GB" laptop with bad thermals will underperform an RTX 4070 Ti Super desktop in real use. Read sustained-load reviews.
Buying for what you might run someday. Buy for the model you will run this month. The field moves fast enough that hardware bought "for future 200B models" will be eclipsed by something better before those models matter to you.

The buy-or-wait calculation in 2026

NVIDIA's RTX 60-series is rumored for late 2027. AMD's RDNA 5 (RX 10000-series) is rumored for the same window. Apple's M5 family is expected late 2026.

If you're shopping in mid-2026, you have a clear window: the RTX 50 series and RX 9070 line are at MSRP availability, the used 30-series is plentiful, and Apple's M4 Max is shipping. If you wait until late 2027 hoping for the next generation, you'll wait through six months of supply constraint and pay launch premiums when it ships. Buy now for the model size you actually want to run, with the assumption that you'll repeat this calculation in 18 months.

Where to go from here

Check exactly what your hardware can run with our compatibility tool: Will it run? — find what your GPU can run
Browse the full hardware database with bandwidth specs and benchmarks: Hardware directory
See benchmarks tagged by source (owner-run, community, vendor): Benchmarks
Read our methodology page for how we calculate tokens-per-second predictions

Frequently asked

Is the RTX 5090 worth $2,500 for local AI?

Only if you specifically need 32 GB of VRAM and the bandwidth headroom for 70B-class models in 4-bit. For most users, a used RTX 3090 at $700 gives the same 24 GB at one-third the price with about 25 percent slower inference. The 5090 only justifies its price for serious enthusiasts running 70B+ models or doing tensor-parallel multi-card builds.

Can I just buy an RTX 4060 Ti 16GB for local AI?

Yes — it's the most underrated AI card on the market. The memory bandwidth is mediocre, but 16 GB of VRAM at sub-$450 unlocks 14B-class models that would be impossible on 8 GB cards. You give up about 30 percent of speed compared to higher-tier cards, but the VRAM-per-dollar ratio is unmatched on the new market.

Should I get an Apple M-series chip instead of a discrete GPU?

If you want a quiet, low-power setup that can run 70B-class models, yes. M4 Max with 128 GB of unified memory genuinely competes with NVIDIA setups for inference, and the unified-memory architecture means you don't fragment between VRAM and system RAM. The catch is software ecosystem — MLX is good but narrower than CUDA, and image generation tooling lags.

Is AMD finally usable for local AI in 2026?

On Linux, yes. ROCm now supports llama.cpp, vLLM, and ExLlamaV2 reliably. The RX 7900 XTX gives you 24 GB at sub-$1,000, comparable VRAM to the RTX 4090 for less money. On Windows, ROCm support is improving but still trails — Vulkan backend in llama.cpp works but you give up about 40 percent of speed vs Linux+ROCm.

How much VRAM do I really need?

12 GB is the floor for any serious AI work — fits 7B-14B in 4-bit with usable context. 16 GB is the comfortable mid-range, fits 32B-class. 24 GB unlocks 70B Q4 with offload tricks. Above 24 GB enters workstation territory. If your budget caps you at 8 GB, you can still run small models, but 12 GB at any tier is dramatically more useful.

This guide is hand-curated by RunLocalAI Editorial. It is reviewed at least quarterly and on each major hardware launch. If a number on this page is wrong or outdated, email corrections@runlocalai.co and we'll update within a week.

See our editorial policy for how we research and verify hardware claims, and our how we make money page for affiliate-link disclosures.