Mixed RTX 4090 + 3090 workstation — the asymmetric upgrade path
RTX 4090 + RTX 3090 in the same chassis. PCIe-only (no NVLink). 48 GB total / ~42 GB effective via llama.cpp layer-split. The transitional configuration when you already own one card and want to add the other — operationally honest, not a target build.
What this stack accomplishes (and what it doesn't)
What this stack accomplishes:
- 70B-class models run via llama.cpp layer-split across 48 GB total
- Workable single-user inference for hobby / experimentation
- Bridge between owning one card and committing to a symmetric pair
What it does NOT accomplish:
- Tensor parallelism — vLLM/SGLang assume symmetric cards; mixed pairs underperform by 2-3×
- Maximum throughput — the 4090 wastes ~30% of its compute waiting for the 3090
- Production-grade reliability — too many asymmetric failure surfaces
Hardware required
1× RTX 4090 24GB · 1× RTX 3090 24GB · ATX motherboard with 2× PCIe 4.0 x16 slots · 1300W Platinum PSU · 64-128GB DDR5 · 1TB NVMe · Ubuntu 24.04 LTS · Open-frame chassis OR ATX tower with riser cable for slot 2
Components — what to install and why
- 01HardwarePrimary GPU (faster, takes more layers)rtx-4090
RTX 4090 is the throughput leader; in layer-split mode it takes ~55% of the layers. Its FP8 capability is wasted in this config since llama.cpp doesn't extract FP8 the way TensorRT-LLM does.
- 02HardwareSecondary GPU (slower, takes fewer layers)rtx-3090
RTX 3090 holds ~45% of layers in layer-split. Its older Ampere architecture and lower memory bandwidth (936 vs 1008 GB/s) make it the pacing card.
- 03ToolInference engine (asymmetric layer-split)llama-cpp
llama.cpp is the only practical runtime for asymmetric pairs. Its --tensor-split argument accepts unequal ratios; vLLM and SGLang assume symmetric cards and underperform by 2-3× on mixed setups.
- 04ToolAlternative for asymmetric layer-splitexllamav2
ExLlamaV2 EXL2 quants accept ratio-based split. Sharper quants than GGUF at equivalent size; pick when peak per-stream throughput on the 4090's strengths matters.
- 05ModelPrimary 70B model (Q4_K_M)llama-3.3-70b-instruct
70B Q4_K_M (~40 GB weights) fits the 48 GB total via layer-split. Real per-stream throughput trails dual-3090 NVLink by ~40-50% due to asymmetric stalling.
- 06ModelCoding model (better fit for this combo)qwen-2.5-coder-32b-instruct
32B-class fits with substantial headroom on the asymmetric pair. The smaller model amplifies the mismatch less than 70B (fewer cross-card transitions per token).
Step-by-step setup
1. Driver + CUDA
sudo apt install -y nvidia-driver-560 cuda-toolkit-12-6
sudo reboot
# Verify mixed config detected
nvidia-smi --query-gpu=index,name,memory.total --format=csv
# Expected:
# 0, NVIDIA GeForce RTX 4090, 24564 MiB
# 1, NVIDIA GeForce RTX 3090, 24576 MiB2. Build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j$(nproc)
# Pull a quantized 70B model
huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF \
Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--local-dir ~/models3. Run with asymmetric layer-split
./llama-server \
--model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 80 \
--tensor-split 1.2,1.0 \
--ctx-size 8192 \
--port 8080The --tensor-split 1.2,1.0 ratio puts ~55% of layers on GPU 0 (4090, faster) and ~45% on GPU 1 (3090, slower). Tune to your specific model — see next section.
Tuning the layer-split ratio
Layer-split assumes equal per-layer compute cost; with asymmetric memory bandwidth this isn't quite true. Measure first, tune second.
# Run a benchmark with different splits and pick the best:
for split in "1.0,1.0" "1.1,1.0" "1.2,1.0" "1.3,1.0"; do
echo "Testing split=$split"
./llama-bench --model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 80 --tensor-split $split -p 512 -n 128
doneTypical optimal: 1.15-1.25 for the 4090. If you see GPU 0 utilization much higher than GPU 1, increase the 4090's ratio. If GPU 1 is pegged at 100% while GPU 0 idles, drop the ratio.
Expected outcome
llama.cpp with --tensor-split tuned for the asymmetric ratio (4090 takes more layers; the slower 3090 holds fewer). Llama 3.3 70B Q4_K_M serves at 12-18 tok/s decode (single stream). The faster 4090 spends ~30% of its time waiting for the 3090 — that's the cost of asymmetry.
Realistic per-stream metrics:
- Llama 3.3 70B Q4_K_M decode: 12-18 tok/s
- TTFT (1K prompt): 400-600 ms
- Power: 700-800W combined under load
Why this is operationally compromised
The fundamental issue: every layer transition between the two cards requires waiting for the slower card. The 4090 has ~30% more compute and 7% more memory bandwidth, but those advantages are wasted on a layer-split where the cards process sequentially per token.
Compare three configs serving the same Llama 3.3 70B Q4 model:
| Config | Tok/s decode | Setup time | Total cost |
|---|---|---|---|
| Mixed 4090 + 3090 (this stack) | 12-18 | 8-12 hours | $2,300-2,800 |
| Dual 3090 NVLink | 25-30 | 4-6 hours | $1,200-1,800 |
| Dual 4090 PCIe | 28-35 | 4-6 hours | $4,000-5,000 |
Failure modes you'll hit
- Defaulting to vLLM tensor-parallel. vLLM TP-2 underperforms layer-split by 2-3× on mixed pairs. Force pipeline-parallel mode explicitly, or use llama.cpp.
- Layer-split ratio mistuning. Default 1.0,1.0 wastes the 4090's capacity. Measure with
llama-bench, tune to 1.15-1.25. - Driver compatibility regressions. NVIDIA driver updates occasionally introduce per-architecture regressions. Test driver upgrades on a non-production rig first.
- Power-supply asymmetry. The 4090 spikes to 600W transient; the 3090 to 450W. PSU sizing must accommodate both simultaneously, hitting 1100W+ momentarily.
- PCIe lane allocation. Both cards typically run at PCIe 4.0 x8 on consumer boards. The 4090 doesn't notice; the 3090 doesn't either.
Troubleshooting
Symptom: GPU 0 (4090) at 30% utilization, GPU 1 (3090) at 95%. Layer split is too generous to the 4090. Drop the ratio (e.g. from 1.3,1.0 to 1.1,1.0).
Symptom: GPU 1 at 30%, GPU 0 at 95%. Reverse — increase the 4090 ratio (1.0,1.0 → 1.2,1.0).
Symptom: vLLM tensor-parallel runs but slower than llama.cpp layer-split. Expected. vLLM doesn't handle asymmetric cards well. Switch to llama.cpp.
Should you build this from scratch? (No.)
For new builds, pair-match instead. Mixed-GPU costs more per-tok/s than either dual-3090 or dual-4090. The only justifications:
- You already own one of the cards and the resale loss exceeds the operational cost
- You're building cheaply from used parts and can only find one card at a time
- You're explicitly experimenting with asymmetric inference techniques
When this stack is the right answer
- You own a 3090 and want to add a 4090 without reselling. (12-18 tok/s on 70B is “good enough” for your workload.)
- You're researching layer-split tuning, asymmetric inference, or transition strategies.
- You have an unused 3090 from an older build and want to extend its life.
For everyone else: pair-match. Sell the existing card and buy a matching pair. The operational simplicity and per-tok/s economics are dramatically better.
Going deeper
- Mixed 4090 + 3090 combo detail — the operator-grade tradeoff analysis.
- Will-it-run for this combo — fit verdict for every catalog model.
- Multi-GPU buying guide — full decision framework.
- Dual RTX 3090 stack — the cheaper symmetric alternative.
- Dual RTX 4090 stack — the newer-architecture symmetric alternative.