NVIDIA GeForce RTX 4090
The community-default high-end local-AI card from 2022 to 2025. 24GB GDDR6X at ~1 TB/s makes 70B Q4 comfortably loadable.
The 24 GB VRAM is the headline — it's the smallest amount of memory that runs the modern 32B-class models full-GPU at Q4 (Qwen 3 32B, Qwen 2.5 Coder 32B, QwQ 32B, R1 Distill Qwen 32B all live in 19–22 GB), and the largest amount that's reasonable to buy on consumer pricing. Memory bandwidth at 1 TB/s keeps tokens/sec respectable on those models — 70+ tok/s is normal at Q4. CUDA support is universal: every runner has a happy path here.
Where it breaks- 70B-class models partial-offload — Q4 70B is 39 GB, so you offload onto system RAM and watch tok/s drop from 70+ to 22–28. Fine for serious work, painful for autocomplete.
- Llama 4 / DeepSeek V3 don't fit — true workstation models start at 80+ GB at Q4.
- Power draw at sustained load — 450 W TGP is real; consider PSU and thermals before slotting one in.
- Sweet spot: Qwen 3 32B / Qwen 2.5 Coder 32B / QwQ 32B / R1 Distill Qwen 32B — full GPU, 70+ tok/s, full 16K context.
- Stretch: Llama 3.3 70B / R1 Distill Llama 70B at Q4 with offload — 22–28 tok/s, requires 64+ GB system RAM.
- Comfortable: 14B-class with full 32K context, or 7–8B with 128K context — runs at 60+ tok/s with headroom.
- Genuine 100B+ MoE workloads (DeepSeek V3, Llama 4 Maverick) — workstation hardware required.
- Maximum tok/s on tiny models — for sub-7B at >200 tok/s, integrated solutions and lower-end cards are better $/throughput.
- Workstation reliability over years — the 4090 has consumer warranty terms; sustained 24/7 inference is technically out-of-spec.
Buy this if you want the best single-card local-AI experience and either bought one at MSRP or are willing to pay the current resale premium. The model sweet spot (32B-class) is the highest-quality tier that runs full-GPU. Skip this if the RTX 5090 is in your budget (32 GB at higher bandwidth fits 70B more comfortably), if you can wait for RTX 5090/5080 supply to normalize, or if 16 GB cards (RTX 4080, 5080) cover your model range — the 32B-class jump is what you're paying the premium for.
How it compares- vs RTX 5090 → 5090 is the proper successor (32 GB, ~1.8 TB/s bandwidth) and lets 70B-class models partial-offload less aggressively. Pick 5090 if available at sane prices.
- vs RTX 3090 → 3090 has the same 24 GB VRAM at lower bandwidth (~940 GB/s). Roughly 60% the tok/s on the same model. Best value-for-VRAM card if you can find a clean used unit.
- vs RTX 4080 (16 GB) → 4080 is the wrong tier — 16 GB caps you at 14B-class full-GPU. The jump to 32B-class is what makes the 4090 worth the premium.
- vs RX 7900 XTX → 7900 XTX matches 24 GB at lower price, but ROCm is still a hassle and Vulkan paths are slower for most workloads. NVIDIA wins on software maturity.
- vs Apple M3 Max (64 GB+) → unified memory lets the M3 Max run 70B at Q4 without offload, at slower tokens/sec than 4090 partial-offload but with much less hassle. Different platform tradeoff.
›Why this rating
9.4/10 — the consumer card every local-AI build benchmarks against. 24 GB VRAM at frontier-tier compute means you can full-GPU-offload Qwen 3 32B and Qwen 2.5 Coder 32B; partial-offload Llama 3.3 70B at usable speeds. Loses points only because the RTX 5090 exists at higher VRAM and the resale price is now stupid.
Overview
The community-default high-end local-AI card from 2022 to 2025. 24GB GDDR6X at ~1 TB/s makes 70B Q4 comfortably loadable.
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Specs
| VRAM | 24 GB |
| Power draw | 450 W |
| Released | 2022 |
| MSRP | $1599 |
| Backends | CUDA Vulkan |
Benchmarks on this hardware
Real measurements on NVIDIA GeForce RTX 4090. Numbers ship with the runner version, quant, and date so you can reproduce them.
| Model | Conf. | Quant | Ctx | Tokens / sec | VRAM | TTFT | Date |
|---|---|---|---|---|---|---|---|
| Mistral 7B Instruct v0.3 | M | Q4_K_M | 4K | 112.3tok/s | 5.1 GB | 64 ms | Apr 22, 26 |
| Llama 3.1 8B Instruct | M | Q4_K_M | 8K | 104.7tok/s | 5.4 GB | 78 ms | Apr 22, 26 |
| Mixtral 8x7B Instruct | M | Q4_K_M | 8K | 31.4tok/s | 23.1 GB | 248 ms | Apr 23, 26 |
Models that fit
Open-weight models small enough to run on NVIDIA GeForce RTX 4090 with usable context.
Hardware worth comparing
Same VRAM tier and the one step above and below — so you can frame the buying decision against real options.
Frequently asked
What models can NVIDIA GeForce RTX 4090 run?
Does NVIDIA GeForce RTX 4090 support CUDA?
How much does NVIDIA GeForce RTX 4090 cost?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.