Dual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs
Two used RTX 3090s with NVLink + vLLM tensor-parallel-2 = 70B-class local AI at the lowest viable price in 2026. Effective 46 GB VRAM, ~25-30 tok/s decode, Apache 2.0 model coverage. The prosumer multi-GPU baseline.
What this stack accomplishes
Two used RTX 3090s with an NVLink 3-slot bridge is the cheapest path to 70B-class local AI in 2026. Total system cost lands between $2,000-2,800 (used 3090s + bridge + new motherboard / PSU / case) — roughly half the price of a dual-RTX-4090 build with comparable VRAM envelope, and meaningfully cheaper than a single H100 80GB used.
The honest framing of the tradeoff:
- Dual 3090 NVLink extracts close-to-theoretical tensor-parallel throughput. ~25-30 tok/s on Llama 3.3 70B AWQ-INT4.
- Single 5090 gives ~50-65 tok/s on 32B-class but caps the largest model at ~32B at Q4.
- Dual 4090 PCIe (no NVLink) hits ~28-35 tok/s on 70B but costs 2-3× more.
If your workload requires 70B+ models, dual 3090 is the operator default. If your workload caps at 32B, skip multi-GPU entirely and buy a single 5090 — see the multi-GPU buying guide for the full decision framework.
Hardware required
2× RTX 3090 24GB (used) · NVLink 3-slot bridge · ATX motherboard with 2× PCIe 4.0 x16 slots (e.g. ASRock Rack ROMED8-2T or MSI X670E Carbon) · 1000W Platinum PSU · 64GB DDR5 · 1TB NVMe · Ubuntu 22.04 LTS or 24.04 LTS · 3-slot or open-frame chassis with intake + inter-card cooling
Components — what to install and why
- 01HardwareGPUs (2× 24GB used, the cheapest path to 48 GB total)rtx-3090
Used 3090 prices in 2026 sit around $600-900 per card — two for less than half a single new 5090. The NVLink 3.0 bridge between them is the only consumer-tier setup that genuinely extracts tensor-parallel performance over consumer hardware. 4090 has no NVLink; 5090 has no NVLink either.
- 02ToolInference engine (production tensor-parallel-2)vllm
vLLM tensor-parallel-2 with --tensor-parallel-size 2 is the canonical configuration. AWQ-INT4 fits 70B in the 48 GB envelope with ~6 GB headroom for KV cache at 8K context. Continuous batching extracts throughput at 4-8 concurrent agent loops; at this scale SGLang's prefix-cache wins are also significant — pick by workload.
- 03ToolAlternative high-throughput runtimeexllamav2
ExLlamaV2 with EXL2 quants is the throughput leader on dual-3090 NVLink for single-stream decode. Slightly sharper than vLLM AWQ-INT4 at the cost of a less-mature serving stack. Use when peak per-stream tok/s matters more than concurrent serving.
- 04ModelPrimary 70B chat / instruction modelllama-3.3-70b-instruct
Llama 3.3 70B at AWQ-INT4 (~40 GB weights) fits dual-3090 with comfortable 6 GB headroom. The L1.25-enriched 70B-class default; superseded only by Qwen 2.5 72B for Apache-license sensitivity.
- 05Model70B reasoning model with explicit thinking-modedeepseek-r1-distill-llama-70b
Same hardware envelope as Llama 3.3 70B. R1 distill produces 5-15× more tokens per query (reasoning bloat), so per-stream throughput drops to 8-15 tok/s effective. Reserve for hard reasoning workloads; use Llama 3.3 for general chat.
- 06Model32B coding model that fits with concurrency roomqwen-2.5-coder-32b-instruct
32B class on dual-3090 leaves significant headroom — fits 8K context AND serves 8-16 concurrent coding-agent loops via vLLM continuous batching. The right pick when the workload is coding-tier rather than chat-tier.
- 07ToolTeam chat frontendopenwebui
Open WebUI handles multi-user chat against the vLLM endpoint. Container-deployed; the standard frontend pairing for vLLM-backed serving. AnythingLLM is the alternative when document-grounding is the primary use case.
Step-by-step setup
Assumes Ubuntu 24.04 LTS. NVIDIA proprietary driver 550+ required for NVLink 3.0 support.
1. NVIDIA driver + CUDA toolkit
# Add NVIDIA repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-6 nvidia-driver-560
# Reboot then verify
sudo reboot
nvidia-smi
nvidia-smi nvlink --statusThe nvidia-smi nvlink --status output should show Active on both GPUs. If it shows Disabled or OFF, the bridge is misseated — power off, reseat, retry. This is the single most common operator mistake.
2. Install vLLM with AWQ support
# Use venv to avoid system-Python pollution
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install --upgrade pip
pip install vllm autoawq
# Verify CUDA + GPU detection
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
# Expected: True 23. Pull and serve Llama 3.3 70B AWQ-INT4
# Pre-pull weights (substitute your HuggingFace cache path)
huggingface-cli download casperhansen/llama-3.3-70b-instruct-awq
# Serve with tensor-parallel-2
vllm serve casperhansen/llama-3.3-70b-instruct-awq \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--quantization awq \
--gpu-memory-utilization 0.92 \
--port 8000Set --gpu-memory-utilization 0.92 to reserve ~2 GB per card for system processes and KV-cache headroom. Setting it to 1.0 will OOM under load when inter-card communication spikes.
4. Stand up Open WebUI as the team frontend
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=sk-no-key-required \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:mainOpen WebUI auto-discovers the vLLM endpoint as an OpenAI-compatible model. Browse tohttp://localhost:3000 and the Llama 3.3 70B model will appear in the chat selector.
Verifying NVLink
Run a quick benchmark to confirm NVLink is delivering its bandwidth advantage:
# nvbandwidth comes with the CUDA toolkit
nvbandwidth -t device_to_device_bidirectional_memcpy_ce
# Expected: peer-to-peer bandwidth between GPU 0 and GPU 1
# should report ~100+ GB/s per direction.
# If it reports ~16 GB/s, NVLink is NOT engaged — bridge issue.Expected outcome
vLLM tensor-parallel-2 serving Llama 3.3 70B AWQ-INT4 at 25-30 tok/s decode with 8K context. nvidia-smi nvlink --status reports OFF only when bridge is misseated. Sustained power draw under load: 680-720W. Memory-junction temps stay below 100°C with active inter-card cooling.
Concrete metrics on a properly-tuned dual-3090 + vLLM + Llama 3.3 70B AWQ-INT4 setup:
- Single-stream decode: 25-30 tok/s
- 4-concurrent decode: 60-90 tok/s aggregate
- TTFT (time-to-first-token, 1K prompt): 250-350 ms
- Memory-junction temp: 85-95°C with intake + inter-card fans
- Power draw: 680-720W steady, 850W transient peak
Power, thermal, and acoustic notes
Two 350W cards in a single chassis is at the upper edge of what consumer cooling tolerates. Plan for:
- 1000W Platinum PSU minimum. Sustained 700W draw at 80% utilization is acceptable; transient spikes can hit 850W.
- Open-frame or server chassis preferred. Standard ATX towers choke the lower card's intake; thermal throttling on GPU 1 is the typical result.
- Inter-card fan mandatory. Memory-junction temps on the inner card routinely hit 105°C without active cooling — invisible in standard
nvidia-smioutput (use--query-gpu=temperature.memory). - Acoustic envelope. Two 350W cards under load are loud. Plan basement / server-room placement for 24/7 deployment.
Failure modes you'll hit
- NVLink bridge looks seated but isn't. Most common operator mistake.
nvidia-smi nvlink --statuswill showDisabledorOFF. Power off, reseat the bridge until it clicks fully, retry. - PCIe lane starvation. Most consumer chipsets drop the second slot to x8 when both are populated. For tensor parallelism this is acceptable; for pipeline parallelism it costs ~15-25% throughput. Workstation boards (TRX50, W790) maintain x16/x16 but cost $1,000+.
- Memory-junction overheating on the inner card. The card sandwiched between others heats faster. Monitor with
nvidia-smi --query-gpu=temperature.memory --format=csv -l 5during long-context inference. If junction temps cross 105°C, throttling kicks in silently. - Thermal-paste degradation on used cards. 5-year-old 3090s benefit from a thermal-paste refresh, dropping core temps 8-15°C. Plan for re-paste before committing to production workloads.
- Driver / vLLM version mismatch. NVIDIA driver 550+ + CUDA 12.6 + vLLM 0.7+ is the working combination for AWQ-INT4 on Ampere. Older driver versions silently fall back to slower kernels.
Troubleshooting
Symptom: tensor-parallel decode is slower than single-card. Check nvidia-smi nvlink --status first. If Active, verify vLLM is actually using NVLink with VLLM_NCCL_DEBUG=INFO in the environment — the log should show NCCL choosing NVLink for inter-rank communication. If it's falling back to PCIe, your CUDA driver is too old.
Symptom: random OOM on long-context queries. KV cache for 70B at 8K context consumes ~6 GB across both cards. If --gpu-memory-utilization is set above 0.95, you're leaving no headroom — drop to 0.92 and retry.
Symptom: model loads on GPU 0 only. Verify both cards are visible to the vLLM process: CUDA_VISIBLE_DEVICES=0,1 vllm serve .... NUMA placement on dual-socket boards can also hide GPUs from one socket's processes.
Variations and alternatives
Reasoning-first variant: replace Llama 3.3 70B with DeepSeek R1 Distill Llama 70B. Same hardware envelope, explicit thinking-mode emission. Per-stream throughput drops to 8-15 tok/s effective due to reasoning-token bloat.
Coding-first variant: drop to Qwen 2.5 Coder 32B for 8-16 concurrent coding-agent loops via vLLM continuous batching. The 32B class fits with significant headroom on dual-3090, freeing VRAM for larger KV caches.
Multilingual variant: swap to Qwen 2.5 72B — Apache 2.0, stronger non-English coverage, similar throughput envelope.
Who should avoid this build
- First-time local-AI builders — start with single 4090 or 5090. Multi-GPU debugging is real work.
- Anyone needing production reliability — used 3090s have unknown service histories. Datacenter hardware (H100 / A100) is the right answer when uptime matters.
- Living-room deployments — 700W of GPU + cooling is loud and hot. This is a basement / server-room build.
- Workloads that fit in 32 GB — single 5090 wins on simplicity, power, and warranty for everything 32B and below.
Going deeper
- Dual RTX 3090 combo detail — operator-grade review with effective-VRAM math + failure modes.
- Will-it-run for this combo — see exactly which models fit, are borderline, or aren't practical.
- Multi-GPU buying guide — the full decision framework for multi-GPU local AI.
- Distributed inference systems — when even dual-3090 isn't enough.
- Dual RTX 4090 workstation — newer-architecture sibling, 2× the cost, no NVLink.