nvidia
GPU
24GB VRAM
enthusiast

NVIDIA GeForce RTX 4090

The community-default high-end local-AI card from 2022 to 2025. 24GB GDDR6X at ~1 TB/s makes 70B Q4 comfortably loadable.

Released 2022·~$1899 street·1008 GB/s memory bandwidth
Our verdict
By Fredoline Eruo·Last verified May 6, 2026
9.4/10
What it does well

The 24 GB VRAM is the headline — it's the smallest amount of memory that runs the modern 32B-class models full-GPU at Q4 (Qwen 3 32B, Qwen 2.5 Coder 32B, QwQ 32B, R1 Distill Qwen 32B all live in 19–22 GB), and the largest amount that's reasonable to buy on consumer pricing. Memory bandwidth at 1 TB/s keeps tokens/sec respectable on those models — 70+ tok/s is normal at Q4. CUDA support is universal: every runner has a happy path here.

Where it breaks
  • 70B-class models partial-offload — Q4 70B is 39 GB, so you offload onto system RAM and watch tok/s drop from 70+ to 22–28. Fine for serious work, painful for autocomplete.
  • Llama 4 / DeepSeek V3 don't fit — true workstation models start at 80+ GB at Q4.
  • Power draw at sustained load — 450 W TGP is real; consider PSU and thermals before slotting one in.
Ideal model range
  • Sweet spot: Qwen 3 32B / Qwen 2.5 Coder 32B / QwQ 32B / R1 Distill Qwen 32B — full GPU, 70+ tok/s, full 16K context.
  • Stretch: Llama 3.3 70B / R1 Distill Llama 70B at Q4 with offload — 22–28 tok/s, requires 64+ GB system RAM.
  • Comfortable: 14B-class with full 32K context, or 7–8B with 128K context — runs at 60+ tok/s with headroom.
Bad use cases
  • Genuine 100B+ MoE workloads (DeepSeek V3, Llama 4 Maverick) — workstation hardware required.
  • Maximum tok/s on tiny models — for sub-7B at >200 tok/s, integrated solutions and lower-end cards are better $/throughput.
  • Workstation reliability over years — the 4090 has consumer warranty terms; sustained 24/7 inference is technically out-of-spec.
Verdict

Buy this if you want the best single-card local-AI experience and either bought one at MSRP or are willing to pay the current resale premium. The model sweet spot (32B-class) is the highest-quality tier that runs full-GPU. Skip this if the RTX 5090 is in your budget (32 GB at higher bandwidth fits 70B more comfortably), if you can wait for RTX 5090/5080 supply to normalize, or if 16 GB cards (RTX 4080, 5080) cover your model range — the 32B-class jump is what you're paying the premium for.

How it compares
  • vs RTX 5090 → 5090 is the proper successor (32 GB, ~1.8 TB/s bandwidth) and lets 70B-class models partial-offload less aggressively. Pick 5090 if available at sane prices.
  • vs RTX 3090 → 3090 has the same 24 GB VRAM at lower bandwidth (~940 GB/s). Roughly 60% the tok/s on the same model. Best value-for-VRAM card if you can find a clean used unit.
  • vs RTX 4080 (16 GB) → 4080 is the wrong tier — 16 GB caps you at 14B-class full-GPU. The jump to 32B-class is what makes the 4090 worth the premium.
  • vs RX 7900 XTX → 7900 XTX matches 24 GB at lower price, but ROCm is still a hassle and Vulkan paths are slower for most workloads. NVIDIA wins on software maturity.
  • vs Apple M3 Max (64 GB+) → unified memory lets the M3 Max run 70B at Q4 without offload, at slower tokens/sec than 4090 partial-offload but with much less hassle. Different platform tradeoff.
Why this rating

9.4/10 — the consumer card every local-AI build benchmarks against. 24 GB VRAM at frontier-tier compute means you can full-GPU-offload Qwen 3 32B and Qwen 2.5 Coder 32B; partial-offload Llama 3.3 70B at usable speeds. Loses points only because the RTX 5090 exists at higher VRAM and the resale price is now stupid.

Overview

The community-default high-end local-AI card from 2022 to 2025. 24GB GDDR6X at ~1 TB/s makes 70B Q4 comfortably loadable.

Where to buy
Geo-routed to your region. Approx. $1899.

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Specs

VRAM24 GB
Power draw450 W
Released2022
MSRP$1599
Backends
CUDA
Vulkan

Benchmarks on this hardware

Real measurements on NVIDIA GeForce RTX 4090. Numbers ship with the runner version, quant, and date so you can reproduce them.

3 runs on record
ModelConf.QuantCtxTokens / secVRAMTTFTDate
Mistral 7B Instruct v0.3MQ4_K_M4K
112.3tok/s
5.1 GB64 msApr 22, 26
Llama 3.1 8B InstructMQ4_K_M8K
104.7tok/s
5.4 GB78 msApr 22, 26
Mixtral 8x7B InstructMQ4_K_M8K
31.4tok/s
23.1 GB248 msApr 23, 26

Models that fit

Open-weight models small enough to run on NVIDIA GeForce RTX 4090 with usable context.

Compare alternatives

Hardware worth comparing

Same VRAM tier and the one step above and below — so you can frame the buying decision against real options.

Same VRAM tier
Cards in the same memory band
Step up
More VRAM — bigger models, more context
No verdicted hardware in the next tier up yet.
Step down
Less VRAM — cheaper, more constrained

Frequently asked

What models can NVIDIA GeForce RTX 4090 run?

With 24GB VRAM, the NVIDIA GeForce RTX 4090 runs models up to ~32B in 4-bit, with room for context. See the model list below for tested combinations.

Does NVIDIA GeForce RTX 4090 support CUDA?

Yes — NVIDIA GeForce RTX 4090 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

How much does NVIDIA GeForce RTX 4090 cost?

Current street price for NVIDIA GeForce RTX 4090 is around $1899 (MSRP $1599). Prices vary by region and supply.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.