Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink
Two RTX 4090s + vLLM tensor-parallel-2 over PCIe (no NVLink). Effective ~45 GB VRAM, ~28-35 tok/s decode on 70B. The newer-architecture alternative to dual-3090 — 2× the cost, ~30% faster, FP8 support, no NVLink.
What this stack accomplishes
Two RTX 4090s in a single chassis run 70B-class local AI with newer-architecture compute, FP8 quantization support, and new-card warranty. The catch every tutorial glosses over: NVIDIA removed NVLink from the 4090 generation. Two 4090s communicate only over PCIe at ~32 GB/s aggregate bandwidth — vs ~112 GB/s NVLink on dual-3090.
The honest comparison:
- Dual 4090 PCIe: ~28-35 tok/s on 70B Q4. New cards. FP8. ~$4,500-5,500 total system cost.
- Dual 3090 NVLink: ~25-30 tok/s on 70B Q4. Used cards. ~$2,000-2,800 total system cost.
- Single 5090: ~50-65 tok/s on 32B Q4. Largest model envelope: 32B Q4. ~$2,500-3,000 total.
The 4090 generation is the operator default when you specifically need FP8 quantization (some research workloads, TensorRT-LLM with FP8) or new-card warranty for production. For maximum cost-efficiency at the 70B tier, dual-3090 wins. For anything ≤32B, single 5090 wins.
Hardware required
2× RTX 4090 24GB · ATX motherboard with 2× PCIe 4.0/5.0 x16 slots + adequate spacing for triple-slot cards (workstation board recommended: TRX50, W790, X670E Carbon WiFi) · 1500W Platinum PSU · 64-128GB DDR5 · 1TB NVMe · Ubuntu 24.04 LTS · Open-frame or large server chassis with riser cables for slot-2 placement
Components — what to install and why
- 01HardwareGPUs (2× 24GB new, FP8-capable Ada-architecture)rtx-4090
RTX 4090 brings Ada-architecture compute (FP8 transformer engine, faster GDDR6X memory) but NVIDIA removed NVLink from this generation. Two 4090s communicate only over PCIe at ~32 GB/s aggregate vs ~112 GB/s on dual-3090 NVLink. Pick 4090 over 3090 for new-card warranty + FP8 support; pick 3090 for cost-efficiency.
- 02ToolInference engine (PCIe-aware tensor-parallel-2)vllm
vLLM tensor-parallel-2 with --tensor-parallel-size 2. Verify NVLink is NOT engaged via nvidia-smi — that's expected on 4090. Performance trail vs NVLink-equipped pair is ~10-15% on tensor-parallel decode. SGLang is the better pick when prefix-cache hit rate is high (agent loops with stable system prompts).
- 03ToolFP8 throughput leader (when committed to NVIDIA stack)tensorrt-llm
TensorRT-LLM extracts the FP8 advantage that the Ada architecture supports natively. Recompile-per-config friction is real, but for production deployments where the model + quant are stable, TRT-LLM throughput beats vLLM by 15-25%. Use only when committed to the rebuild discipline.
- 04ModelPrimary 70B modelllama-3.3-70b-instruct
Same 70B-class envelope as dual-3090. AWQ-INT4 fits 48 GB total / ~45 GB effective with 6 GB headroom for KV at 8K context. The L1.25-enriched 70B canonical.
- 05ModelMoE workload (3B active, 30B total)qwen-3-30b-a3b
30B/3B-active MoE on dual-4090 PCIe is the throughput sweet spot. Expert routing across cards is bandwidth-friendlier than tensor parallelism for dense models, so the no-NVLink penalty is smaller. ~80 tok/s decode at 8 concurrent.
- 06Model32B reasoning modeldeepseek-r1-distill-qwen-32b
32B class on dual-4090 leaves substantial headroom — 64K context fits comfortably. R1 distill brings explicit thinking-mode emission for hard reasoning tasks. The L1.25-enriched workstation reasoning canonical.
- 07ToolTeam chat frontendopenwebui
Same role as dual-3090 build. Open WebUI talks OpenAI-compatible to vLLM; standard pairing for vLLM-backed serving.
Step-by-step setup
Assumes Ubuntu 24.04 LTS. NVIDIA proprietary driver 555+ required for FP8 transformer-engine support.
1. NVIDIA driver + CUDA toolkit
# Add NVIDIA repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-6 nvidia-driver-560
# Reboot then verify
sudo reboot
nvidia-smi
# Verify both cards detected
nvidia-smi --query-gpu=name,memory.total --format=csv2. Verify PCIe peer-to-peer (no NVLink)
# Confirm NVLink is OFF (this is correct, not a bug)
nvidia-smi nvlink --status
# Expected output: "GPU 0: Disabled"
# If it says "Active", you have a 3090 not a 4090 — verify your hardware.
# Run peer-to-peer bandwidth check
nvbandwidth -t device_to_device_bidirectional_memcpy_ce
# Expected: 28-32 GB/s aggregate (PCIe 4.0 x8/x8 each direction)
# If under 16 GB/s: PCIe lane allocation is wrong — see motherboard manualMost consumer X670/B650 boards drop the second slot to x8 when both are populated. For tensor-parallel inference this is acceptable; for pipeline-parallel it costs ~10% throughput.
3. Install vLLM with AWQ + FP8 support
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install --upgrade pip
pip install vllm autoawq
# Verify CUDA + GPU detection
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
# Expected: True 24. Serve Llama 3.3 70B AWQ-INT4 with tensor-parallel-2
# Pre-pull weights
huggingface-cli download casperhansen/llama-3.3-70b-instruct-awq
# Serve with tensor-parallel-2 (PCIe-aware)
vllm serve casperhansen/llama-3.3-70b-instruct-awq \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--quantization awq \
--gpu-memory-utilization 0.92 \
--port 8000Note: --gpu-memory-utilization 0.92 is the recommended ceiling. 4090s have less memory bandwidth headroom than H100 SXM, and KV-cache spikes during long-context inference can OOM at higher utilization values.
Verifying PCIe peer-to-peer
The nvbandwidth tool ships with the CUDA toolkit and verifies that GPU 0 and GPU 1 can talk to each other directly. On dual-4090 expect:
- device_to_device_bidirectional: 28-32 GB/s aggregate
- device_to_device_unidirectional: 14-16 GB/s per direction
If unidirectional bandwidth comes in under 12 GB/s, your PCIe lanes have dropped to x4 or you have a chipset bottleneck. Workstation boards (TRX50, W790) deliver x16/x16 sustained; consumer boards (X670E, Z790) typically deliver x8/x8.
Expected outcome
vLLM tensor-parallel-2 serving Llama 3.3 70B AWQ-INT4 at 28-35 tok/s decode with 8K context. PCIe-only cross-card communication; verify with nvbandwidth that peer-to-peer hits ~32 GB/s aggregate. Sustained power draw under load: 850-950W. FP8 inference unlocks slightly better throughput on supported quants vs Ampere.
Concrete metrics on a properly-tuned dual-4090 + vLLM + Llama 3.3 70B AWQ-INT4 setup:
- Single-stream decode: 28-35 tok/s (PCIe penalty: ~10-15% vs theoretical NVLink-equivalent)
- 4-concurrent decode: 70-100 tok/s aggregate
- TTFT (1K prompt): 200-280 ms
- Power draw: 850-950W steady, 1100W transient peak
- Memory-junction temps: 80-90°C in open-frame chassis with intake fans
Power, thermal, and acoustic notes
- 1500W PSU minimum. Two 4090s spike to 600W transient each. A 1200W PSU sized for “2×450W=900W” will trigger over-current protection during boost transients. Size for headroom.
- 12VHPWR connector wear. Two cards = two connectors = two failure points. Use cables rated for 600W+ per card; don't reuse the original launch-era 12VHPWR adapters that had melt issues.
- PCIe slot spacing matters. Two triple-slot 4090s require either riser cables or a workstation chassis. Most ATX cases choke the lower card's intake; thermal throttling on slot 2 is the typical result.
- Acoustic envelope. Two 450W cards plus chassis fans is loud — expect 45-55 dBA at 1m under load. Plan for basement / server-room placement for 24/7 deployment.
Failure modes you'll hit
- Tutorial assumes NVLink. Most dual-GPU setup guides assume NVLink. On 4090 there is none. Verify
nvidia-smi nvlink --statusreportsDisabled(correct) and skip any “configure NVLink” instructions. - PCIe lane allocation collapse. Consumer boards drop second slot to x8 when both are populated. Workstation boards maintain x16/x16. The x8 drop costs another 10% throughput vs full x16.
- Power transient over-current shutdowns. Two 4090s under load can spike briefly to 1200W+. PSUs sized exactly for 900W steady will trip OCP. Size for 1500W+.
- FP8 quant pipeline mismatches. Some FP8 quants from HuggingFace require specific TensorRT-LLM versions. Verify the quant's manifest before assuming it loads in vLLM.
- Driver-runtime version mismatch. NVIDIA driver 555+, CUDA 12.6+, and vLLM 0.7+ is the working combination for FP8 + AWQ on Ada. Older driver versions silently fall back to slower kernels and miss the FP8 advantage.
Troubleshooting
Symptom: tensor-parallel decode is slower than expected. First confirm PCIe x16/x16 (or at minimum x8/x8) via lspci -vvv | grep -i lnksta — check LnkSta width on both GPU slots. If x4, your motherboard is bottlenecking.
Symptom: vLLM crashes with NCCL timeout. NCCL inter-rank communication over PCIe is more sensitive to system noise than NVLink. SetNCCL_TIMEOUT=600 in the environment for long-context workloads to avoid spurious timeouts.
Symptom: random slot-2 thermal throttling. The lower card's intake is choked by the upper card. Raise the chassis intake-fan speed or move slot 2 to a riser cable for unobstructed airflow.
Variations and alternatives
MoE-first variant: serve Qwen 3 30B-A3B for high-concurrency agent serving. The 3B-active MoE shape minimizes cross-card communication (expert routing only), so the no-NVLink penalty is smaller — close to single-card per-stream throughput at 8+ concurrent.
Reasoning-first variant: drop to DeepSeek R1 Distill Qwen 32B. The 32B class fits with 64K context and significant headroom.
SGLang for agent loops: when prefix-cache hit rate is high (agent harnesses with stable system prompts), SGLang's RadixAttention compounds harder than vLLM's continuous batching. Swap the runtime; same model + quant.
Who should avoid this build
- Cost-sensitive builders — dual-3090 NVLink is 50-65% cheaper for similar VRAM envelope and faster tensor-parallel decode. See Dual RTX 3090 stack.
- Anyone hoping NVIDIA put NVLink on 4090 — they didn't.
- Workstation users who could just buy an H100 80GB used — at $20-30k used, H100 is overkill for individuals but right for orgs.
- Living-room deployments — 950W of GPU + cooling is loud. Basement / server-room build only.
Going deeper
- Dual RTX 4090 combo detail — operator-grade review with effective-VRAM math + failure modes.
- Will-it-run for this combo — see exactly which models fit, are borderline, or aren't practical.
- Multi-GPU buying guide — full decision framework + buyer guidance.
- Dual RTX 3090 stack — cheaper alternative with NVLink.
- Distributed inference systems — architectural depth on tensor / pipeline / expert routing.