Mixed RTX 4090 + 3090 workstation (May 2026) — the asymmetric upgrade path

What this stack accomplishes (and what it doesn't)

What this stack accomplishes:

70B-class models run via llama.cpp layer-split across 48 GB total
Workable single-user inference for hobby / experimentation
Bridge between owning one card and committing to a symmetric pair

What it does NOT accomplish:

Tensor parallelism — vLLM/SGLang assume symmetric cards; mixed pairs underperform by 2-3×
Maximum throughput — the 4090 wastes ~30% of its compute waiting for the 3090
Production-grade reliability — too many asymmetric failure surfaces

Hardware required

1× RTX 4090 24GB · 1× RTX 3090 24GB · ATX motherboard with 2× PCIe 4.0 x16 slots · 1300W Platinum PSU · 64-128GB DDR5 · 1TB NVMe · Ubuntu 24.04 LTS · Open-frame chassis OR ATX tower with riser cable for slot 2

Components — what to install and why

The stack

01
HardwarePrimary GPU (faster, takes more layers)
rtx-4090
RTX 4090 is the throughput leader; in layer-split mode it takes ~55% of the layers. Its FP8 capability is wasted in this config since llama.cpp doesn't extract FP8 the way TensorRT-LLM does.
02
HardwareSecondary GPU (slower, takes fewer layers)
rtx-3090
RTX 3090 holds ~45% of layers in layer-split. Its older Ampere architecture and lower memory bandwidth (936 vs 1008 GB/s) make it the pacing card.
03
ToolInference engine (asymmetric layer-split)
llama-cpp
llama.cpp is the only practical runtime for asymmetric pairs. Its --tensor-split argument accepts unequal ratios; vLLM and SGLang assume symmetric cards and underperform by 2-3× on mixed setups.
04
ToolAlternative for asymmetric layer-split
exllamav2
ExLlamaV2 EXL2 quants accept ratio-based split. Sharper quants than GGUF at equivalent size; pick when peak per-stream throughput on the 4090's strengths matters.
05
ModelPrimary 70B model (Q4_K_M)
llama-3.3-70b-instruct
70B Q4_K_M (~40 GB weights) fits the 48 GB total via layer-split. Real per-stream throughput trails dual-3090 NVLink by ~40-50% due to asymmetric stalling.
06
ModelCoding model (better fit for this combo)
qwen-2.5-coder-32b-instruct
32B-class fits with substantial headroom on the asymmetric pair. The smaller model amplifies the mismatch less than 70B (fewer cross-card transitions per token).

Step-by-step setup

1. Driver + CUDA

sudo apt install -y nvidia-driver-560 cuda-toolkit-12-6
sudo reboot

# Verify mixed config detected
nvidia-smi --query-gpu=index,name,memory.total --format=csv
# Expected:
#   0, NVIDIA GeForce RTX 4090, 24564 MiB
#   1, NVIDIA GeForce RTX 3090, 24576 MiB

2. Build llama.cpp with CUDA support

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j$(nproc)

# Pull a quantized 70B model
huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF \
  Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --local-dir ~/models

3. Run with asymmetric layer-split

./llama-server \
  --model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 80 \
  --tensor-split 1.2,1.0 \
  --ctx-size 8192 \
  --port 8080

The --tensor-split 1.2,1.0 ratio puts ~55% of layers on GPU 0 (4090, faster) and ~45% on GPU 1 (3090, slower). Tune to your specific model — see next section.

Tuning the layer-split ratio

Layer-split assumes equal per-layer compute cost; with asymmetric memory bandwidth this isn't quite true. Measure first, tune second.

# Run a benchmark with different splits and pick the best:
for split in "1.0,1.0" "1.1,1.0" "1.2,1.0" "1.3,1.0"; do
  echo "Testing split=$split"
  ./llama-bench --model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
    --n-gpu-layers 80 --tensor-split $split -p 512 -n 128
done

Typical optimal: 1.15-1.25 for the 4090. If you see GPU 0 utilization much higher than GPU 1, increase the 4090's ratio. If GPU 1 is pegged at 100% while GPU 0 idles, drop the ratio.

Expected outcome

llama.cpp with --tensor-split tuned for the asymmetric ratio (4090 takes more layers; the slower 3090 holds fewer). Llama 3.3 70B Q4_K_M serves at 12-18 tok/s decode (single stream). The faster 4090 spends ~30% of its time waiting for the 3090 — that's the cost of asymmetry.

Realistic per-stream metrics:

Llama 3.3 70B Q4_K_M decode: 12-18 tok/s
TTFT (1K prompt): 400-600 ms
Power: 700-800W combined under load

Why this is operationally compromised

The fundamental issue: every layer transition between the two cards requires waiting for the slower card. The 4090 has ~30% more compute and 7% more memory bandwidth, but those advantages are wasted on a layer-split where the cards process sequentially per token.

Compare three configs serving the same Llama 3.3 70B Q4 model:

Config	Tok/s decode	Setup time	Total cost
Mixed 4090 + 3090 (this stack)	12-18	8-12 hours	$2,300-2,800
Dual 3090 NVLink	25-30	4-6 hours	$1,200-1,800
Dual 4090 PCIe	28-35	4-6 hours	$4,000-5,000

Failure modes you'll hit

Defaulting to vLLM tensor-parallel. vLLM TP-2 underperforms layer-split by 2-3× on mixed pairs. Force pipeline-parallel mode explicitly, or use llama.cpp.
Layer-split ratio mistuning. Default 1.0,1.0 wastes the 4090's capacity. Measure with llama-bench, tune to 1.15-1.25.
Driver compatibility regressions. NVIDIA driver updates occasionally introduce per-architecture regressions. Test driver upgrades on a non-production rig first.
Power-supply asymmetry. The 4090 spikes to 600W transient; the 3090 to 450W. PSU sizing must accommodate both simultaneously, hitting 1100W+ momentarily.
PCIe lane allocation. Both cards typically run at PCIe 4.0 x8 on consumer boards. The 4090 doesn't notice; the 3090 doesn't either.

Troubleshooting

Symptom: GPU 0 (4090) at 30% utilization, GPU 1 (3090) at 95%. Layer split is too generous to the 4090. Drop the ratio (e.g. from 1.3,1.0 to 1.1,1.0).

Symptom: GPU 1 at 30%, GPU 0 at 95%. Reverse — increase the 4090 ratio (1.0,1.0 → 1.2,1.0).

Symptom: vLLM tensor-parallel runs but slower than llama.cpp layer-split. Expected. vLLM doesn't handle asymmetric cards well. Switch to llama.cpp.

Should you build this from scratch? (No.)

For new builds, pair-match instead. Mixed-GPU costs more per-tok/s than either dual-3090 or dual-4090. The only justifications:

You already own one of the cards and the resale loss exceeds the operational cost
You're building cheaply from used parts and can only find one card at a time
You're explicitly experimenting with asymmetric inference techniques

When this stack is the right answer

You own a 3090 and want to add a 4090 without reselling. (12-18 tok/s on 70B is “good enough” for your workload.)
You're researching layer-split tuning, asymmetric inference, or transition strategies.
You have an unused 3090 from an older build and want to extend its life.

For everyone else: pair-match. Sell the existing card and buy a matching pair. The operational simplicity and per-tok/s economics are dramatically better.

Going deeper

Mixed 4090 + 3090 combo detail — the operator-grade tradeoff analysis.
Will-it-run for this combo — fit verdict for every catalog model.
Multi-GPU buying guide — full decision framework.
Dual RTX 3090 stack — the cheaper symmetric alternative.
Dual RTX 4090 stack — the newer-architecture symmetric alternative.