Quad RTX 3090 workstation stack — the prosumer 100B-class ceiling
4 used RTX 3090s in an open-frame chassis. 96 GB total / ~88 GB effective. vLLM tensor-parallel-4 for 100B-class dense or 50-100B MoE models. The prosumer ceiling before paying datacenter prices.
What this stack accomplishes
Quad RTX 3090 is the prosumer ceiling for local AI. Beyond this, you're paying datacenter prices (H100 / A100 cluster) or distributing across multiple machines. The use case sweet spot:
- 100B-class dense models that don't fit in 48 GB
- Large MoE models (Mixtral 8x22B, DeepSeek MoE)
- 70B serving at high concurrency (16+ users)
- Long-context (32K-128K) inference where KV cache budget matters
What it's NOT for: single-user latency (dual-3090 is faster per-stream), production at scale (datacenter wins on reliability per dollar), or anyone whose answer to “where will this live?” is “my desk.”
Hardware required
4× RTX 3090 24GB (used) · NVLink 3-slot bridges (paired: cards 0-1 and 2-3) · WRX80 / W790 workstation motherboard with 4× PCIe 4.0 x16 slots · 1600W Platinum PSU OR dual 1000W PSUs synced via Add2PSU · 128GB DDR5 ECC · 2TB NVMe · Open-frame mining rig OR 4U server chassis · Inter-card cooling fans · Ubuntu 24.04 LTS
Components — what to install and why
- 01HardwareGPUs (4× 24GB used; the prosumer-ceiling stack)rtx-3090
Quad 3090 is the largest practical multi-GPU setup before datacenter pricing kicks in. Used 3090 economics: $600-900 each → $2,400-3,600 total for the GPUs vs $20,000+ for a single H100 80GB used.
- 02ToolInference engine (tensor-parallel-4)vllm
vLLM --tensor-parallel-size 4 is the canonical configuration for quad-GPU. NVLink between paired cards (0-1, 2-3); cross-pair traffic goes over PCIe. For maximum single-stream latency, run 2× tensor-parallel-2 replicas instead — counter-intuitive but the cross-pair PCIe overhead at TP-4 dominates.
- 03ToolOrchestration layer for 2×TP-2 replica patternray-serve
Ray Serve over quad-3090: split into 2 replicas of tensor-parallel-2 each. Higher single-stream throughput than 1×TP-4; better aggregate concurrency. Use when serving 8+ users.
- 04ModelLarge MoE model (39B-active, 141B total)mixtral-8x22b-instruct
Mixtral 8x22B at AWQ-INT4 fits across 88 GB effective with comfortable headroom. Expert routing across 4 cards is bandwidth-friendlier than dense tensor-parallel — the no-NVLink penalty between paired cards shrinks for MoE.
- 05Model70B reasoning model with extreme contextdeepseek-r1-distill-llama-70b
70B at AWQ-INT4 (~40 GB) fits with massive headroom on quad-3090, freeing ~48 GB for KV cache — supports 64K+ context for long-document reasoning workloads.
- 06ModelCoding agent serving for 16+ concurrent usersqwen-2.5-coder-32b-instruct
32B-class coding model on quad-3090 leaves enormous headroom — vLLM continuous batching serves 16+ concurrent coding-agent loops at ~30 tok/s each. The team-tier coding-serve config.
- 07ToolMulti-user team chat frontendopenwebui
Open WebUI handles authentication + multi-user chat against the vLLM endpoint. Standard pairing for vLLM-backed serving at team scale.
Step-by-step setup
Assumes Ubuntu 24.04 LTS on a WRX80 or W790 workstation motherboard.
1. NVIDIA driver + CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-6 nvidia-driver-560
sudo reboot
# Verify all 4 cards detected
nvidia-smi --query-gpu=index,name,memory.total,pci.bus_id --format=csv
# Expected: 4 lines, RTX 3090 24GB each, distinct bus IDs
# Verify PCIe lane allocation (workstation board should be x16/x16/x16/x16)
sudo lspci -vvv | grep -A 1 "VGA.*NVIDIA" | grep -i "lnksta"2. Verify NVLink pair status
nvidia-smi nvlink --status
# Expected output for paired NVLink (cards 0-1, cards 2-3):
# GPU 0: Active (linked to GPU 1)
# GPU 1: Active (linked to GPU 0)
# GPU 2: Active (linked to GPU 3)
# GPU 3: Active (linked to GPU 2)
# Cross-pair (0↔2, 0↔3, 1↔2, 1↔3) goes over PCIe.
# Run pair-bandwidth check
nvbandwidth -t device_to_device_bidirectional_memcpy_ce
# Expected: paired cards report ~100+ GB/s; cross-pair ~32 GB/s (PCIe)3. Install vLLM and serve Mixtral 8x22B
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install --upgrade pip
pip install vllm autoawq
# Pull model
huggingface-cli download casperhansen/mixtral-8x22b-instruct-v0.1-awq
# Serve with tensor-parallel-4
vllm serve casperhansen/mixtral-8x22b-instruct-v0.1-awq \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--quantization awq \
--gpu-memory-utilization 0.90 \
--port 8000Lower --gpu-memory-utilization to 0.90 (vs 0.92 for dual-card) — quad-card tensor-parallel has more cross-card communication overhead and benefits from the larger headroom for sync buffers.
4. (Optional) Run 2× tensor-parallel-2 replicas via Ray Serve
# For higher single-stream latency, split into 2 TP-2 replicas:
ray start --head
serve start
serve run vllm_serve:app --bind 0.0.0.0:8000 \
--num-replicas 2 --tensor-parallel-size 2Verifying NVLink + PCIe topology
Critical: most consumer chipsets cannot deliver x16 to all 4 slots. Verify with:
# Check link width on each GPU
for gpu in 0 1 2 3; do
echo "GPU $gpu:"
sudo nvidia-smi -i $gpu --query-gpu=pci.link.width.current --format=csv,noheader
done
# Expected on workstation board: 16, 16, 16, 16 (all x16)
# On consumer board: typical 8, 8, 8, 8 — costs ~15% throughputExpected outcome
vLLM tensor-parallel-4 serving Mixtral 8x22B AWQ-INT4 at 25-35 tok/s decode (single stream); 100+ tok/s aggregate at 4 concurrent. Sustained power: 1300-1450W. Memory-junction temps stay below 105°C with active inter-card cooling. Operates 24/7 in a basement or server room only.
Power, thermal, and acoustic notes (read this first)
- 1600W Platinum PSU minimum for sustained 1400W draw. Transient spikes can hit 1700-1800W; size for headroom or use dual 1000W PSUs synced via Add2PSU.
- Open-frame chassis or 4U server chassis required. Standard ATX towers cannot dissipate 1400W of GPU heat — thermal throttling on inner cards is the typical result.
- Inter-card cooling fans mandatory. Cards in slots 2-3 are sandwiched and reach memory-junction temps over 110°C without active cooling. Industrial 120mm fans between cards.
- Acoustic envelope: 60-70 dBA at 1m. Loud. Plan basement / server-room placement; this is not desk-friendly.
- Power cost: 1400W × 24h × 365 days = 12,264 kWh/year. At $0.13/kWh that's ~$1,600/year just for the GPUs at sustained 100% utilization.
Failure modes you'll hit
- PCIe lane allocation collapse on consumer boards. Most X670/B650 boards drop to x8/x8/x8/x4 with all 4 slots populated. Workstation boards (WRX80/W790) are required for x16/x16/x16/x16. Without it, tensor-parallel performance drops 15-25%.
- Power-supply transient overload. Four 3090s spike to 1700-1800W transient. A 1600W PSU sized exactly trips OCP. Use dual PSUs or 1800W+.
- NVLink-incompatible runtime configurations. vLLM auto-detects NVLink between paired cards. Some llama.cpp builds require explicit
--tensor-splitarguments to use it; getting the syntax wrong silently disables NVLink. - NUMA placement on dual-socket builds. GPUs need to be on the same NUMA node as the inference process. Mismatches cost 30%+ throughput silently. Use
numactl --cpunodebind=0 --membind=0 vllm serve .... - Memory-junction thermal throttling on inner cards. Slot-2 and slot-3 cards heat faster (sandwiched).
nvidia-smi --query-gpu=temperature.memoryreveals junction temps invisible in the standard output. - Used-card thermal-paste degradation. 5-year-old 3090s benefit from a thermal-paste refresh. Plan for re-paste on all 4 before committing to production workloads.
Troubleshooting
Symptom: tensor-parallel-4 is slower than expected. Check the actual PCIe link widths via lspci. If x8 instead of x16, your motherboard is bottlenecking. Workstation boards (WRX80/W790) deliver true x16 to all 4 slots.
Symptom: 2 cards visible, 4 plugged in. Power supply is undersized — the BIOS may be dropping cards that draw too much current at boot. Verify each card individually with a single-card config first.
Symptom: random crashes after hours of inference. Memory-junction thermal throttling has crossed the failure threshold. Add inter-card fans; consider water-cooling the inner cards.
Variations and alternatives
Frontier-MoE variant: this stack does not fit Qwen 3.5 235B-A17B or DeepSeek V4 Pro at any practical quant. Move to Mac Studio M3 Ultra 192GB or 4× H100 cluster.
Reasoning-first variant: serve DeepSeek R1 Distill Llama 70B with 64K+ context — quad-3090 leaves substantial KV-cache headroom for long reasoning chains.
Single-machine alternative: Mac Studio M3 Ultra 192GB fits 192 GB at 370W vs quad-3090's 1400W. Slower per-stream but simpler operationally.
Who should avoid this build
- Living-room or office deployments — thermal + acoustic envelope is unacceptable.
- First-time multi-GPU builders — start with dual before quad. Operational complexity is real.
- Anyone considering a single H100 80GB — at $20-30k used, H100 makes sense if budget allows; quad-3090 is the path when budget is hard-capped at $4-5k.
- Production users needing reliability — datacenter hardware exists for a reason. Quad used consumer cards is a hobby-lab choice.
Going deeper
- Quad RTX 3090 combo detail — operator-grade review with effective-VRAM math + failure modes.
- Will-it-run for this combo — see exactly which models fit, are borderline, or aren't practical.
- Multi-GPU buying guide — full decision framework.
- Dual RTX 3090 stack — entry-tier multi-GPU step.
- Distributed inference systems — architectural depth.