NVIDIA GeForce RTX 4090 for local AI

What it does well

The 24 GB VRAM is the headline — it's the smallest amount of memory that runs the modern 32B-class models full-GPU at Q4 (Qwen 3 32B, Qwen 2.5 Coder 32B, QwQ 32B, R1 Distill Qwen 32B all live in 19–22 GB), and the largest amount that's reasonable to buy on consumer pricing. Memory bandwidth at 1 TB/s keeps tokens/sec respectable on those models — 70+ tok/s is normal at Q4. CUDA support is universal: every runner has a happy path here.

Where it breaks

70B-class models partial-offload — Q4 70B is 39 GB, so you offload onto system RAM and watch tok/s drop from 70+ to 22–28. Fine for serious work, painful for autocomplete.
Llama 4 / DeepSeek V3 don't fit — true workstation models start at 80+ GB at Q4.
Power draw at sustained load — 450 W TGP is real; consider PSU and thermals before slotting one in.

Ideal model range

Sweet spot: Qwen 3 32B / Qwen 2.5 Coder 32B / QwQ 32B / R1 Distill Qwen 32B — full GPU, 70+ tok/s, full 16K context.
Stretch: Llama 3.3 70B / R1 Distill Llama 70B at Q4 with offload — 22–28 tok/s, requires 64+ GB system RAM.
Comfortable: 14B-class with full 32K context, or 7–8B with 128K context — runs at 60+ tok/s with headroom.

Bad use cases

Genuine 100B+ MoE workloads (DeepSeek V3, Llama 4 Maverick) — workstation hardware required.
Maximum tok/s on tiny models — for sub-7B at >200 tok/s, integrated solutions and lower-end cards are better $/throughput.
Workstation reliability over years — the 4090 has consumer warranty terms; sustained 24/7 inference is technically out-of-spec.

Verdict

Buy this if you want the best single-card local-AI experience and either bought one at MSRP or are willing to pay the current resale premium. The model sweet spot (32B-class) is the highest-quality tier that runs full-GPU. Skip this if the RTX 5090 is in your budget (32 GB at higher bandwidth fits 70B more comfortably), if you can wait for RTX 5090/5080 supply to normalize, or if 16 GB cards (RTX 4080, 5080) cover your model range — the 32B-class jump is what you're paying the premium for.

How it compares

vs RTX 5090 → 5090 is the proper successor (32 GB, ~1.8 TB/s bandwidth) and lets 70B-class models partial-offload less aggressively. Pick 5090 if available at sane prices.
vs RTX 3090 → 3090 has the same 24 GB VRAM at lower bandwidth (~940 GB/s). Roughly 60% the tok/s on the same model. Best value-for-VRAM card if you can find a clean used unit.
vs RTX 4080 (16 GB) → 4080 is the wrong tier — 16 GB caps you at 14B-class full-GPU. The jump to 32B-class is what makes the 4090 worth the premium.
vs RX 7900 XTX → 7900 XTX matches 24 GB at lower price, but ROCm is still a hassle and Vulkan paths are slower for most workloads. NVIDIA wins on software maturity.
vs Apple M3 Max (64 GB+) → unified memory lets the M3 Max run 70B at Q4 without offload, at slower tokens/sec than 4090 partial-offload but with much less hassle. Different platform tradeoff.

Model	Conf.	Quant	Ctx	Tokens / sec	VRAM	TTFT	Date
Mistral 7B Instruct v0.3	M	Q4_K_M	4K	112.3tok/s	5.1 GB	64 ms	Apr 22, 26
Llama 3.1 8B Instruct	M	Q4_K_M	8K	104.7tok/s	5.4 GB	78 ms	Apr 22, 26
Mixtral 8x7B Instruct	M	Q4_K_M	8K	31.4tok/s	23.1 GB	248 ms	Apr 23, 26

Model

Conf.

Quant

Ctx

Tokens / sec

VRAM

TTFT

Date

Mistral 7B Instruct v0.3

Q4_K_M

112.3tok/s

5.1 GB

64 ms

Apr 22, 26

Llama 3.1 8B Instruct

Q4_K_M

104.7tok/s

5.4 GB

78 ms

Apr 22, 26

Mixtral 8x7B Instruct

Q4_K_M

31.4tok/s

23.1 GB

248 ms

Apr 23, 26

Frequently asked

What models can NVIDIA GeForce RTX 4090 run?

With 24GB VRAM, the NVIDIA GeForce RTX 4090 runs models up to ~32B in 4-bit, with room for context. See the model list below for tested combinations.

Does NVIDIA GeForce RTX 4090 support CUDA?

Yes — NVIDIA GeForce RTX 4090 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

How much does NVIDIA GeForce RTX 4090 cost?

Current street price for NVIDIA GeForce RTX 4090 is around $1899 (MSRP $1599). Prices vary by region and supply.

NVIDIA GeForce RTX 4090

Overview

Specs

Benchmarks on this hardware

Models that fit

Hardware worth comparing

Frequently asked

What models can NVIDIA GeForce RTX 4090 run?

Does NVIDIA GeForce RTX 4090 support CUDA?

How much does NVIDIA GeForce RTX 4090 cost?

VRAM	24 GB
Power draw	450 W
Released	2022
MSRP	$1599
Backends	CUDA Vulkan