Dual RTX 4090 workstation stack (May 2026) — newer-architecture 70B without NVLink

What this stack accomplishes

Two RTX 4090s in a single chassis run 70B-class local AI with newer-architecture compute, FP8 quantization support, and new-card warranty. The catch every tutorial glosses over: NVIDIA removed NVLink from the 4090 generation. Two 4090s communicate only over PCIe at ~32 GB/s aggregate bandwidth — vs ~112 GB/s NVLink on dual-3090.

The honest comparison:

Dual 4090 PCIe: ~28-35 tok/s on 70B Q4. New cards. FP8. ~$4,500-5,500 total system cost.
Dual 3090 NVLink: ~25-30 tok/s on 70B Q4. Used cards. ~$2,000-2,800 total system cost.
Single 5090: ~50-65 tok/s on 32B Q4. Largest model envelope: 32B Q4. ~$2,500-3,000 total.

The 4090 generation is the operator default when you specifically need FP8 quantization (some research workloads, TensorRT-LLM with FP8) or new-card warranty for production. For maximum cost-efficiency at the 70B tier, dual-3090 wins. For anything ≤32B, single 5090 wins.

Hardware required

2× RTX 4090 24GB · ATX motherboard with 2× PCIe 4.0/5.0 x16 slots + adequate spacing for triple-slot cards (workstation board recommended: TRX50, W790, X670E Carbon WiFi) · 1500W Platinum PSU · 64-128GB DDR5 · 1TB NVMe · Ubuntu 24.04 LTS · Open-frame or large server chassis with riser cables for slot-2 placement

Components — what to install and why

The stack

01
HardwareGPUs (2× 24GB new, FP8-capable Ada-architecture)
rtx-4090
RTX 4090 brings Ada-architecture compute (FP8 transformer engine, faster GDDR6X memory) but NVIDIA removed NVLink from this generation. Two 4090s communicate only over PCIe at ~32 GB/s aggregate vs ~112 GB/s on dual-3090 NVLink. Pick 4090 over 3090 for new-card warranty + FP8 support; pick 3090 for cost-efficiency.
02
ToolInference engine (PCIe-aware tensor-parallel-2)
vllm
vLLM tensor-parallel-2 with --tensor-parallel-size 2. Verify NVLink is NOT engaged via nvidia-smi — that's expected on 4090. Performance trail vs NVLink-equipped pair is ~10-15% on tensor-parallel decode. SGLang is the better pick when prefix-cache hit rate is high (agent loops with stable system prompts).
03
ToolFP8 throughput leader (when committed to NVIDIA stack)
tensorrt-llm
TensorRT-LLM extracts the FP8 advantage that the Ada architecture supports natively. Recompile-per-config friction is real, but for production deployments where the model + quant are stable, TRT-LLM throughput beats vLLM by 15-25%. Use only when committed to the rebuild discipline.
04
ModelPrimary 70B model
llama-3.3-70b-instruct
Same 70B-class envelope as dual-3090. AWQ-INT4 fits 48 GB total / ~45 GB effective with 6 GB headroom for KV at 8K context. The L1.25-enriched 70B canonical.
05
ModelMoE workload (3B active, 30B total)
qwen-3-30b-a3b
30B/3B-active MoE on dual-4090 PCIe is the throughput sweet spot. Expert routing across cards is bandwidth-friendlier than tensor parallelism for dense models, so the no-NVLink penalty is smaller. ~80 tok/s decode at 8 concurrent.
06
Model32B reasoning model
deepseek-r1-distill-qwen-32b
32B class on dual-4090 leaves substantial headroom — 64K context fits comfortably. R1 distill brings explicit thinking-mode emission for hard reasoning tasks. The L1.25-enriched workstation reasoning canonical.
07
ToolTeam chat frontend
openwebui
Same role as dual-3090 build. Open WebUI talks OpenAI-compatible to vLLM; standard pairing for vLLM-backed serving.

Step-by-step setup

Assumes Ubuntu 24.04 LTS. NVIDIA proprietary driver 555+ required for FP8 transformer-engine support.

1. NVIDIA driver + CUDA toolkit

# Add NVIDIA repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-6 nvidia-driver-560

# Reboot then verify
sudo reboot
nvidia-smi
# Verify both cards detected
nvidia-smi --query-gpu=name,memory.total --format=csv

2. Verify PCIe peer-to-peer (no NVLink)

# Confirm NVLink is OFF (this is correct, not a bug)
nvidia-smi nvlink --status
# Expected output: "GPU 0: Disabled"
# If it says "Active", you have a 3090 not a 4090 — verify your hardware.

# Run peer-to-peer bandwidth check
nvbandwidth -t device_to_device_bidirectional_memcpy_ce
# Expected: 28-32 GB/s aggregate (PCIe 4.0 x8/x8 each direction)
# If under 16 GB/s: PCIe lane allocation is wrong — see motherboard manual

Most consumer X670/B650 boards drop the second slot to x8 when both are populated. For tensor-parallel inference this is acceptable; for pipeline-parallel it costs ~10% throughput.

3. Install vLLM with AWQ + FP8 support

python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install --upgrade pip
pip install vllm autoawq

# Verify CUDA + GPU detection
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
# Expected: True 2

4. Serve Llama 3.3 70B AWQ-INT4 with tensor-parallel-2

# Pre-pull weights
huggingface-cli download casperhansen/llama-3.3-70b-instruct-awq

# Serve with tensor-parallel-2 (PCIe-aware)
vllm serve casperhansen/llama-3.3-70b-instruct-awq \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --quantization awq \
  --gpu-memory-utilization 0.92 \
  --port 8000

Note: --gpu-memory-utilization 0.92 is the recommended ceiling. 4090s have less memory bandwidth headroom than H100 SXM, and KV-cache spikes during long-context inference can OOM at higher utilization values.

Verifying PCIe peer-to-peer

The nvbandwidth tool ships with the CUDA toolkit and verifies that GPU 0 and GPU 1 can talk to each other directly. On dual-4090 expect:

device_to_device_bidirectional: 28-32 GB/s aggregate
device_to_device_unidirectional: 14-16 GB/s per direction

If unidirectional bandwidth comes in under 12 GB/s, your PCIe lanes have dropped to x4 or you have a chipset bottleneck. Workstation boards (TRX50, W790) deliver x16/x16 sustained; consumer boards (X670E, Z790) typically deliver x8/x8.

Expected outcome

vLLM tensor-parallel-2 serving Llama 3.3 70B AWQ-INT4 at 28-35 tok/s decode with 8K context. PCIe-only cross-card communication; verify with nvbandwidth that peer-to-peer hits ~32 GB/s aggregate. Sustained power draw under load: 850-950W. FP8 inference unlocks slightly better throughput on supported quants vs Ampere.

Concrete metrics on a properly-tuned dual-4090 + vLLM + Llama 3.3 70B AWQ-INT4 setup:

Single-stream decode: 28-35 tok/s (PCIe penalty: ~10-15% vs theoretical NVLink-equivalent)
4-concurrent decode: 70-100 tok/s aggregate
TTFT (1K prompt): 200-280 ms
Power draw: 850-950W steady, 1100W transient peak
Memory-junction temps: 80-90°C in open-frame chassis with intake fans

Power, thermal, and acoustic notes

1500W PSU minimum. Two 4090s spike to 600W transient each. A 1200W PSU sized for “2×450W=900W” will trigger over-current protection during boost transients. Size for headroom.
12VHPWR connector wear. Two cards = two connectors = two failure points. Use cables rated for 600W+ per card; don't reuse the original launch-era 12VHPWR adapters that had melt issues.
PCIe slot spacing matters. Two triple-slot 4090s require either riser cables or a workstation chassis. Most ATX cases choke the lower card's intake; thermal throttling on slot 2 is the typical result.
Acoustic envelope. Two 450W cards plus chassis fans is loud — expect 45-55 dBA at 1m under load. Plan for basement / server-room placement for 24/7 deployment.

Failure modes you'll hit

Tutorial assumes NVLink. Most dual-GPU setup guides assume NVLink. On 4090 there is none. Verify nvidia-smi nvlink --status reports Disabled (correct) and skip any “configure NVLink” instructions.
PCIe lane allocation collapse. Consumer boards drop second slot to x8 when both are populated. Workstation boards maintain x16/x16. The x8 drop costs another 10% throughput vs full x16.
Power transient over-current shutdowns. Two 4090s under load can spike briefly to 1200W+. PSUs sized exactly for 900W steady will trip OCP. Size for 1500W+.
FP8 quant pipeline mismatches. Some FP8 quants from HuggingFace require specific TensorRT-LLM versions. Verify the quant's manifest before assuming it loads in vLLM.
Driver-runtime version mismatch. NVIDIA driver 555+, CUDA 12.6+, and vLLM 0.7+ is the working combination for FP8 + AWQ on Ada. Older driver versions silently fall back to slower kernels and miss the FP8 advantage.

Troubleshooting

Symptom: tensor-parallel decode is slower than expected. First confirm PCIe x16/x16 (or at minimum x8/x8) via lspci -vvv | grep -i lnksta — check LnkSta width on both GPU slots. If x4, your motherboard is bottlenecking.

Symptom: vLLM crashes with NCCL timeout. NCCL inter-rank communication over PCIe is more sensitive to system noise than NVLink. SetNCCL_TIMEOUT=600 in the environment for long-context workloads to avoid spurious timeouts.

Symptom: random slot-2 thermal throttling. The lower card's intake is choked by the upper card. Raise the chassis intake-fan speed or move slot 2 to a riser cable for unobstructed airflow.

Variations and alternatives

MoE-first variant: serve Qwen 3 30B-A3B for high-concurrency agent serving. The 3B-active MoE shape minimizes cross-card communication (expert routing only), so the no-NVLink penalty is smaller — close to single-card per-stream throughput at 8+ concurrent.

Reasoning-first variant: drop to DeepSeek R1 Distill Qwen 32B. The 32B class fits with 64K context and significant headroom.

SGLang for agent loops: when prefix-cache hit rate is high (agent harnesses with stable system prompts), SGLang's RadixAttention compounds harder than vLLM's continuous batching. Swap the runtime; same model + quant.

Who should avoid this build

Cost-sensitive builders — dual-3090 NVLink is 50-65% cheaper for similar VRAM envelope and faster tensor-parallel decode. See Dual RTX 3090 stack.
Anyone hoping NVIDIA put NVLink on 4090 — they didn't.
Workstation users who could just buy an H100 80GB used — at $20-30k used, H100 is overkill for individuals but right for orgs.
Living-room deployments — 950W of GPU + cooling is loud. Basement / server-room build only.

Going deeper

Dual RTX 4090 combo detail — operator-grade review with effective-VRAM math + failure modes.
Will-it-run for this combo — see exactly which models fit, are borderline, or aren't practical.
Multi-GPU buying guide — full decision framework + buyer guidance.
Dual RTX 3090 stack — cheaper alternative with NVLink.
Distributed inference systems — architectural depth on tensor / pipeline / expert routing.