Stack · L3 execution·Workstation tier

Dual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs

Two used RTX 3090s with NVLink + vLLM tensor-parallel-2 = 70B-class local AI at the lowest viable price in 2026. Effective 46 GB VRAM, ~25-30 tok/s decode, Apache 2.0 model coverage. The prosumer multi-GPU baseline.

By Fredoline Eruo · Last reviewed 2026-05-06

What this stack accomplishes

Two used RTX 3090s with an NVLink 3-slot bridge is the cheapest path to 70B-class local AI in 2026. Total system cost lands between $2,000-2,800 (used 3090s + bridge + new motherboard / PSU / case) — roughly half the price of a dual-RTX-4090 build with comparable VRAM envelope, and meaningfully cheaper than a single H100 80GB used.

The honest framing of the tradeoff:

  • Dual 3090 NVLink extracts close-to-theoretical tensor-parallel throughput. ~25-30 tok/s on Llama 3.3 70B AWQ-INT4.
  • Single 5090 gives ~50-65 tok/s on 32B-class but caps the largest model at ~32B at Q4.
  • Dual 4090 PCIe (no NVLink) hits ~28-35 tok/s on 70B but costs 2-3× more.

If your workload requires 70B+ models, dual 3090 is the operator default. If your workload caps at 32B, skip multi-GPU entirely and buy a single 5090 — see the multi-GPU buying guide for the full decision framework.

Hardware required

2× RTX 3090 24GB (used) · NVLink 3-slot bridge · ATX motherboard with 2× PCIe 4.0 x16 slots (e.g. ASRock Rack ROMED8-2T or MSI X670E Carbon) · 1000W Platinum PSU · 64GB DDR5 · 1TB NVMe · Ubuntu 22.04 LTS or 24.04 LTS · 3-slot or open-frame chassis with intake + inter-card cooling

Components — what to install and why

The stack
  1. 01
    HardwareGPUs (2× 24GB used, the cheapest path to 48 GB total)
    rtx-3090

    Used 3090 prices in 2026 sit around $600-900 per card — two for less than half a single new 5090. The NVLink 3.0 bridge between them is the only consumer-tier setup that genuinely extracts tensor-parallel performance over consumer hardware. 4090 has no NVLink; 5090 has no NVLink either.

  2. 02
    ToolInference engine (production tensor-parallel-2)
    vllm

    vLLM tensor-parallel-2 with --tensor-parallel-size 2 is the canonical configuration. AWQ-INT4 fits 70B in the 48 GB envelope with ~6 GB headroom for KV cache at 8K context. Continuous batching extracts throughput at 4-8 concurrent agent loops; at this scale SGLang's prefix-cache wins are also significant — pick by workload.

  3. 03
    ToolAlternative high-throughput runtime
    exllamav2

    ExLlamaV2 with EXL2 quants is the throughput leader on dual-3090 NVLink for single-stream decode. Slightly sharper than vLLM AWQ-INT4 at the cost of a less-mature serving stack. Use when peak per-stream tok/s matters more than concurrent serving.

  4. 04
    ModelPrimary 70B chat / instruction model
    llama-3.3-70b-instruct

    Llama 3.3 70B at AWQ-INT4 (~40 GB weights) fits dual-3090 with comfortable 6 GB headroom. The L1.25-enriched 70B-class default; superseded only by Qwen 2.5 72B for Apache-license sensitivity.

  5. 05
    Model70B reasoning model with explicit thinking-mode
    deepseek-r1-distill-llama-70b

    Same hardware envelope as Llama 3.3 70B. R1 distill produces 5-15× more tokens per query (reasoning bloat), so per-stream throughput drops to 8-15 tok/s effective. Reserve for hard reasoning workloads; use Llama 3.3 for general chat.

  6. 06
    Model32B coding model that fits with concurrency room
    qwen-2.5-coder-32b-instruct

    32B class on dual-3090 leaves significant headroom — fits 8K context AND serves 8-16 concurrent coding-agent loops via vLLM continuous batching. The right pick when the workload is coding-tier rather than chat-tier.

  7. 07
    ToolTeam chat frontend
    openwebui

    Open WebUI handles multi-user chat against the vLLM endpoint. Container-deployed; the standard frontend pairing for vLLM-backed serving. AnythingLLM is the alternative when document-grounding is the primary use case.

Step-by-step setup

Assumes Ubuntu 24.04 LTS. NVIDIA proprietary driver 550+ required for NVLink 3.0 support.

1. NVIDIA driver + CUDA toolkit

# Add NVIDIA repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-6 nvidia-driver-560

# Reboot then verify
sudo reboot
nvidia-smi
nvidia-smi nvlink --status

The nvidia-smi nvlink --status output should show Active on both GPUs. If it shows Disabled or OFF, the bridge is misseated — power off, reseat, retry. This is the single most common operator mistake.

2. Install vLLM with AWQ support

# Use venv to avoid system-Python pollution
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install --upgrade pip
pip install vllm autoawq

# Verify CUDA + GPU detection
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
# Expected: True 2

3. Pull and serve Llama 3.3 70B AWQ-INT4

# Pre-pull weights (substitute your HuggingFace cache path)
huggingface-cli download casperhansen/llama-3.3-70b-instruct-awq

# Serve with tensor-parallel-2
vllm serve casperhansen/llama-3.3-70b-instruct-awq \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --quantization awq \
  --gpu-memory-utilization 0.92 \
  --port 8000

Set --gpu-memory-utilization 0.92 to reserve ~2 GB per card for system processes and KV-cache headroom. Setting it to 1.0 will OOM under load when inter-card communication spikes.

4. Stand up Open WebUI as the team frontend

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=sk-no-key-required \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open WebUI auto-discovers the vLLM endpoint as an OpenAI-compatible model. Browse tohttp://localhost:3000 and the Llama 3.3 70B model will appear in the chat selector.

Verifying NVLink

Run a quick benchmark to confirm NVLink is delivering its bandwidth advantage:

# nvbandwidth comes with the CUDA toolkit
nvbandwidth -t device_to_device_bidirectional_memcpy_ce

# Expected: peer-to-peer bandwidth between GPU 0 and GPU 1
# should report ~100+ GB/s per direction.
# If it reports ~16 GB/s, NVLink is NOT engaged — bridge issue.

Expected outcome

vLLM tensor-parallel-2 serving Llama 3.3 70B AWQ-INT4 at 25-30 tok/s decode with 8K context. nvidia-smi nvlink --status reports OFF only when bridge is misseated. Sustained power draw under load: 680-720W. Memory-junction temps stay below 100°C with active inter-card cooling.

Concrete metrics on a properly-tuned dual-3090 + vLLM + Llama 3.3 70B AWQ-INT4 setup:

  • Single-stream decode: 25-30 tok/s
  • 4-concurrent decode: 60-90 tok/s aggregate
  • TTFT (time-to-first-token, 1K prompt): 250-350 ms
  • Memory-junction temp: 85-95°C with intake + inter-card fans
  • Power draw: 680-720W steady, 850W transient peak

Power, thermal, and acoustic notes

Two 350W cards in a single chassis is at the upper edge of what consumer cooling tolerates. Plan for:

  • 1000W Platinum PSU minimum. Sustained 700W draw at 80% utilization is acceptable; transient spikes can hit 850W.
  • Open-frame or server chassis preferred. Standard ATX towers choke the lower card's intake; thermal throttling on GPU 1 is the typical result.
  • Inter-card fan mandatory. Memory-junction temps on the inner card routinely hit 105°C without active cooling — invisible in standard nvidia-smi output (use --query-gpu=temperature.memory).
  • Acoustic envelope. Two 350W cards under load are loud. Plan basement / server-room placement for 24/7 deployment.

Failure modes you'll hit

  1. NVLink bridge looks seated but isn't. Most common operator mistake. nvidia-smi nvlink --status will show Disabled or OFF. Power off, reseat the bridge until it clicks fully, retry.
  2. PCIe lane starvation. Most consumer chipsets drop the second slot to x8 when both are populated. For tensor parallelism this is acceptable; for pipeline parallelism it costs ~15-25% throughput. Workstation boards (TRX50, W790) maintain x16/x16 but cost $1,000+.
  3. Memory-junction overheating on the inner card. The card sandwiched between others heats faster. Monitor with nvidia-smi --query-gpu=temperature.memory --format=csv -l 5 during long-context inference. If junction temps cross 105°C, throttling kicks in silently.
  4. Thermal-paste degradation on used cards. 5-year-old 3090s benefit from a thermal-paste refresh, dropping core temps 8-15°C. Plan for re-paste before committing to production workloads.
  5. Driver / vLLM version mismatch. NVIDIA driver 550+ + CUDA 12.6 + vLLM 0.7+ is the working combination for AWQ-INT4 on Ampere. Older driver versions silently fall back to slower kernels.

Troubleshooting

Symptom: tensor-parallel decode is slower than single-card. Check nvidia-smi nvlink --status first. If Active, verify vLLM is actually using NVLink with VLLM_NCCL_DEBUG=INFO in the environment — the log should show NCCL choosing NVLink for inter-rank communication. If it's falling back to PCIe, your CUDA driver is too old.

Symptom: random OOM on long-context queries. KV cache for 70B at 8K context consumes ~6 GB across both cards. If --gpu-memory-utilization is set above 0.95, you're leaving no headroom — drop to 0.92 and retry.

Symptom: model loads on GPU 0 only. Verify both cards are visible to the vLLM process: CUDA_VISIBLE_DEVICES=0,1 vllm serve .... NUMA placement on dual-socket boards can also hide GPUs from one socket's processes.

Variations and alternatives

Reasoning-first variant: replace Llama 3.3 70B with DeepSeek R1 Distill Llama 70B. Same hardware envelope, explicit thinking-mode emission. Per-stream throughput drops to 8-15 tok/s effective due to reasoning-token bloat.

Coding-first variant: drop to Qwen 2.5 Coder 32B for 8-16 concurrent coding-agent loops via vLLM continuous batching. The 32B class fits with significant headroom on dual-3090, freeing VRAM for larger KV caches.

Multilingual variant: swap to Qwen 2.5 72B — Apache 2.0, stronger non-English coverage, similar throughput envelope.

Who should avoid this build

  • First-time local-AI builders — start with single 4090 or 5090. Multi-GPU debugging is real work.
  • Anyone needing production reliability — used 3090s have unknown service histories. Datacenter hardware (H100 / A100) is the right answer when uptime matters.
  • Living-room deployments — 700W of GPU + cooling is loud and hot. This is a basement / server-room build.
  • Workloads that fit in 32 GB — single 5090 wins on simplicity, power, and warranty for everything 32B and below.

Going deeper