Quad RTX 3090 (24 GB × 4)
Four used 3090s in a homelab chassis. 96 GB total / ~88 GB effective. The cheapest path to 100B+ class models and high-concurrency 70B serving.
Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.
Four 3090s in a single chassis with PCIe + NVLink (paired bridges between cards 0-1 and 2-3) does not produce 96 GB of pooled VRAM. Tensor parallelism across 4 ranks with vLLM yields ~88 GB effective for model weights — total minus ~2 GB per card for activations, KV cache, and runtime overhead. This is the configuration that fits 100B+ class MoE models like DeepSeek V2.5 (236B / 21B-active needs ~134 GB at Q4 — does NOT fit; 100B-class dense models like Llama 3.1 100B-tier do fit). The 88 GB envelope is the realistic ceiling for prosumer multi-GPU before you pay for datacenter hardware.
Topology
- 4×rtx-3090
Recommended runtimes
Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.
Supported split strategies
How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.
Why this combo
Quad RTX 3090 is the prosumer ceiling for local AI. Beyond this you're either paying datacenter prices (H100 / A100 cluster) or distributing across multiple machines.
The use case sweet spot:
- 70B serving at high concurrency (16+ users)
- 100B-class dense or 50-100B MoE models
- Long-context (32K-128K) inference where KV cache budget matters
- Research / experimentation with model splits and serving topologies
What it's NOT for:
- Single-user latency (dual-3090 is faster per-stream for most models)
- Production at scale (datacenter hardware wins on reliability per dollar over a 3-year horizon)
- Anyone whose answer to "where will this live" is "my desk"
Runtime compatibility
- vLLM ✓ excellent.
--tensor-parallel-size 4is the canonical configuration. Production-default for the combo. - SGLang ✓ excellent. RadixAttention compounds harder at 4 ranks because the prefix cache is shared across all 4 GPUs.
- ExLlamaV2 ✓ good. EXL2 quants are great but rank-4 tensor-parallel scales less cleanly than on rank-2.
- Ray Serve ✓ excellent for serving multiple replicas — you can run 2× tensor-parallel-2 instead of 1× tensor-parallel-4 for higher per-stream throughput.
Split strategy
For 100B-class dense, tensor parallelism rank 4 is the path. Per-stream decode is ~20-25 tok/s; aggregate cluster throughput is high.
For MoE, expert routing across 4 cards lets the router put experts on different GPUs. Mixtral 8x7B / 8x22B / DeepSeek MoE all benefit.
For maximum single-stream latency, 2× tensor-parallel-2 replicas via Ray Serve beats 1× tensor-parallel-4 — counter-intuitive but the cross-card overhead at rank-4 dominates.
Comparison vs alternatives
| Metric | Quad 3090 | Single H100 80GB | Mac Studio M3 Ultra 192GB |
|---|---|---|---|
| Effective VRAM | 88 GB | 80 GB | ~140 GB usable |
| Power | 1400W | 700W | ~370W |
| Total cost (used / new) | $3,000-4,500 | $20,000-30,000 | $7,000-10,000 |
| Largest model (Q4) | 100B dense | 80B dense | 200B+ dense |
| Setup difficulty | Advanced | Beginner | Beginner |
The Mac Studio M3 Ultra is genuinely competitive for "largest model that runs at all" but trails on tokens-per-second by 3-5×.
Related
- /hardware-combos/dual-rtx-3090 — entry point to multi-GPU
- /stacks/distributed-inference-homelab — when even quad-3090 isn't enough
- /guides/running-local-ai-on-multiple-gpus-2026 — buying guide
Best model classes
- 100B-class dense models at Q4 — anything that fits in 88 GB. The closest practical targets are smaller MoE flagships and 70B at higher quants.
- 70B at Q5_K_M / Q6 with extended context — quality lift over dual-3090 Q4 at the cost of throughput per card.
- High-concurrency 70B serving — 4 cards = 4 tensor-parallel ranks; vLLM serves 16+ concurrent requests at ~30 tok/s each.
- MoE 50-100B with expert routing — Mixtral 8x22B, DeepSeek Coder V2 236B (just barely doesn't fit at Q4 but Q3 might).
What this combo is bad at
- 236B+ MoE models — DeepSeek V3 / V4 Pro / Qwen 3.5 235B all exceed 88 GB at any practical quant. Forces CPU offload or cluster.
- Single-user latency — 4-rank tensor parallelism doesn't outperform 2-rank for single-stream decode; you only win on concurrency.
- Power-cost-sensitive deployment — 1400W under load × 24/7 = ~12,300 kWh/year = $1,500-2,500/year in electricity at 2026 US rates.
Who should avoid this
- Anyone without a basement or server room — thermal + acoustic envelope is unacceptable in living spaces.
- First-time multi-GPU builders — start with dual before quad.
- Anyone considering a single H100 80GB — at used-market $20k pricing, H100 makes sense if you can swing it; quad 3090 is the path when budget is hard-capped at $4k.
- Production users needing reliability — datacenter hardware exists for a reason; quad used consumer cards is a hobby-lab choice.
1400W of GPU draw under load demands a server chassis or open-frame mining rig. Consumer tower cases cannot handle this thermally. Plan for a 1600W+ Platinum PSU (or dual PSUs synced via Add2PSU adapter), industrial fans, and either basement / server-room placement or accept the heat output of a small space heater. Open-frame rigs are louder but thermally easier than enclosed chassis.
Four used GPUs = four points of failure with unknown service histories. Plan thermal-paste refresh + memory-junction monitoring on all 4 cards before committing to production. Power supply derating: a 1600W PSU under sustained 1400W load runs at 87% capacity — acceptable but at the upper bound. Better to use dual 1000W units.
Ubuntu 22.04 LTS or 24.04 LTS — Linux is mandatory for this scale.
Failure modes specific to quad 3090
- Power-supply transient overload. Four 3090s have correlated transient spikes during model load — a 1600W PSU can trip OCP even at "normal" load when the spike compounds. Size for 1800W+ headroom or use dual PSUs.
- PCIe lane allocation collapse. Most consumer X670/B650 boards can't deliver x16/x16/x8/x8 to four cards. WRX80 / W790 workstation boards are required for full lane allocation. Without it, tensor-parallel performance suffers.
- NVLink pairing constraints. NVLink bridges only work between adjacent slot pairs — you typically get two NVLink pairs (cards 0-1 and 2-3), not a 4-card mesh. Tensor parallelism with rank 4 still works but the cross-pair links go over PCIe.
- Chassis thermal compounding. Even with open-frame mounting, GPU 3 heating GPU 4 is real. Stagger card placement and add inter-card fans.
- Driver-NUMA mismatches. On dual-socket builds, GPUs need to be on the same NUMA node as the CPU running the inference process. Mismatches cost 30%+ throughput silently.
- Memory-junction temperature on the inner cards. Cards in slots 2-3 (sandwiched between others) routinely hit memory-junction temps over 110°C. Active inter-card cooling is mandatory.
Vllm Tensor Parallel H100 Workstation →
Quad-3090 is the prosumer ceiling at ~$4k; 4× H100 is the production reference at $200k+. The H100 cluster wins on reliability per dollar over a 3-year horizon for orgs.
Quad RTX 3090 workstation →
Step-by-step setup with WRX80/W790 motherboard, NVLink pair verification, vLLM tensor-parallel-4 + power/thermal warnings.
Benchmark opportunities
Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.
4× RTX 3090 + DeepSeek R1 Distill Llama 70B (vLLM TP-4)
deepseek-r1-distill-llama-70bReasoning workload on quad-3090. R1 distill produces 5-15× more tokens per query; per-stream throughput drops vs same-size non-reasoning model.
Going deeper
- All hardware combinations — browse other multi-GPU and multi-machine setups.
- Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
- Distributed inference systems — architectural depth.