Best GPU for AI agents
Honest 2026 GPU buyer guide for local AI agents: multi-model loops, tool-use, long context — why 24 GB is the floor and 48 GB unlocks parallel agents.
The short answer
AI agents are the most VRAM-hungry local-AI workload in 2026. A single-agent loop runs at least two models concurrently (reasoning model + embedding model + sometimes a third routing model). 24 GB VRAM is the minimum — 16 GB agents exist but severely constrain model choice.
For production multi-agent pipelines, 48 GB+ VRAM across one or more GPUs is the real target. A 70B reasoning model at Q4 + embedding model at FP16 + 32K context can consume 35-40 GB resident VRAM. Dual used RTX 3090s for ~$1,600 deliver 48 GB — the homelab agent sweet spot.
If you're building on a laptop, the M4 Max 64 GB MacBook Pro at $3,500 is the only laptop that runs a 70B agent + embedding model + context concurrently. x86 laptops cap at 16-24 GB GPU VRAM and can't serve the tier.
The picks, ranked by buyer-leverage
24 GB · $1,400-1,900 used / $1,800-2,200 new
Fits 70B agent Q4 + embedding model + 8K context on a single card. Best solo-GPU agent experience in 2026.
- Single-agent 70B Q4 + embedding model colocated
- Agent pipelines where one model dominates VRAM budget
- Buyers wanting new silicon with warranty + Ada efficiency
- Long-context agent loops (32K context = need 32 GB+)
- Parallel multi-agent serving (needs dual GPUs)
- Budget-constrained builders (used 3090 is half the price)
32 GB · $2,000-2,500 (2026 retail)
32 GB runs 70B agent Q4 + embedding model at 32K context. Single-card multi-agent serving ceiling.
- 70B agent + 32K context + embedding model on one card
- Parallel agent serving (2-3 small agents colocated)
- FP8-native agent inference with headroom
- Multi-large-agent pipelines (still need dual GPUs)
- Cost-conscious builders (dual 3090 cheaper for 48 GB)
- Agent setups that fit in 24 GB (4090 is enough)
Dual RTX 3090 (48 GB combined)
48 GB · ~$1,600 (two used 3090s, 2026)
48 GB combined via vLLM tensor-parallel or ExLlamaV2. The homelab multi-agent default — 70B agent + embedding + routing model.
- Multi-agent pipelines (reasoning + embedding + routing)
- vLLM tensor-parallel 70B FP16 agent inference
- Homelab tinkerers comfortable with multi-GPU setup
- Single-agent workflows (one 4090 is simpler)
- Space-constrained builds (dual 3090s = 4-slot cards)
- Windows users (multi-GPU LLM tooling weaker than Linux)
64 GB · $3,200-4,000 (M4 Max 64 GB MacBook Pro, 2026)
The only laptop that runs a full 70B agent + embedding model + context. 64 GB unified is a genuine agent workstation.
- Mobile agent development (laptop-only workflow)
- 70B agent + embedding + long context on one device
- Developers who value silence + portability over throughput
- CUDA-locked agent frameworks (vLLM, TensorRT)
- Production agent serving (Mac throughput lower)
- Cost-conscious builders (desktop dual 3090 cheaper)
HonestyWhy benchmark numbers on this page might not reflect your real experience
- tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
- Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
- Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
- Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
- Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
- Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
- Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.
We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.
How to think about VRAM tiers
Agent VRAM budgets are additive. You're running at least two models simultaneously (reasoning + embedding), plus KV cache for context, plus overhead. Unlike simple chat, agents don't release VRAM between model calls.
- 16 GB — Single 13B agent Q4 + small embedding model. No room for context scaling. Severely constrained.
- 24 GB (agent minimum) — 70B agent Q4 + small embedding model + 8K context. Single-agent looped workflows viable.
- 32 GB — 70B agent Q4 + embedding model + 32K context. Multi-small-agent colocation. Comfortable single-card ceiling.
- 48 GB+ (multi-agent production) — Multi-agent parallel serving. vLLM tensor-parallel 70B FP16. Dual GPUs or Mac unified memory.
Compare these picks head-to-head
Frequently asked questions
Why do AI agents need so much VRAM?
Agents aren't single-model chat. A typical agent loop runs (1) a reasoning model for planning, (2) an embedding model for retrieval, and sometimes (3) a routing model for tool selection. All three must be VRAM-resident simultaneously. Plus the KV cache grows with multi-turn context. A simple 70B agent loop at Q4 with RAG can consume 35-40 GB.
Can I run agents on a 16 GB GPU?
Technically yes — with severe constraints. You can run a 13B agent Q4 + small embedding model at minimal context. But the agent loops are slower (constant model swapping), context is short, and you can't run 70B-class reasoning models. 16 GB is an agent-learning tier, not an agent-doing tier.
What's the best GPU setup for vLLM agent serving?
vLLM shines on multi-GPU setups. Dual 3090s at 48 GB combined via tensor-parallel serve 70B FP16 for agent inference — the budget production path. Single 4090/5090 works for Q4 quantized serving. If you're running vLLM + RAG pipeline + embedding server, plan for your concurrency ceiling.
Does ExLlamaV2 help for agent workloads?
Yes — ExLlamaV2's Q4 cache makes prompt processing 2-4x faster than llama.cpp. For agent loops with long prompts (tool-use instructions + context + example outputs), the prompt eval speed-up is transformative. Multi-GPU with ExLlamaV2 tensor-parallel is the best homelab agent strategy.
Can I use a Mac for AI agent development?
Yes, if you spec 64+ GB unified. M4 Max 64 GB runs a 70B agent + embedding model at Q4. The metal backend (MLX, llama.cpp) supports most agent frameworks. The limitation is speed — Mac inference is 30-50% slower than comparable NVIDIA for prompt processing, which matters for agent loops with long tool-use prompts.
Do I need multiple GPUs or one big one?
For most solo agents: one big card (4090/5090) is simpler and more reliable. For production multi-agent serving: dual GPUs via vLLM tensor-parallel or ExLlamaV2 are better leverage. Two 3090s at 48 GB deliver more agent VRAM than one 5090 at 32 GB. The PCIe bottleneck on multi-GPU agent serving is real but manageable at PCIe 4.0 x8.
Go deeper
- Best GPU for local AI (pillar) — All workloads ranked across VRAM tiers
- Local AI for coding agents — Code-agent hardware + toolchain picks
- Best GPU for RAG — RAG hardware picks (common agent sub-workload)
- Running local AI on multiple GPUs — Multi-GPU tensor-parallel agent serving strategies
- Best used GPU for local AI — Dual 3090 is the agent homelab default — used buying guide
When it doesn't work
Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:
Common alternatives readers consider:
- If your budget is tighter →best budget GPU for local AI
- If you'd rather buy used →best used GPU for local AI
- If you're on Apple Silicon →best Mac for local AI
- If you're not sure what fits your build →the will-it-run checker
- If you don't want to buy anything yet →our editorial philosophy