RTX 4090 + RTX 3090 (asymmetric 24+24 GB)
Asymmetric multi-GPU: a 4090 paired with a 3090. PCIe 4.0 only — different SM counts, different memory bandwidth. Effective VRAM is bottlenecked by the slower card on most split strategies.
Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.
Mixed-GPU configurations are operationally honest about VRAM but compromised on throughput. The 4090 has 1008 GB/s memory bandwidth vs 936 GB/s on 3090 — close enough for tensor parallelism to work, but the 4090's faster compute is bottlenecked waiting for the 3090 every layer. Effective VRAM is roughly total minus ~3 GB per card (more overhead than symmetric pairs because the runtime loads slightly different weight shards). Tensor parallelism technically works; pipeline parallelism (layer split) works better in practice — different cards do different layers, no synchronization stall on faster card.
Topology
Recommended runtimes
Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.
Supported split strategies
How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.
Why this combo (or why not)
Mixed-GPU builds are the transitional reality of upgrading a single-card rig. You bought a 3090 in 2023, you want to upgrade in 2026, you're not selling the 3090. So you add a 4090.
The honest framing:
- This works. llama.cpp layer-split or ExLlamaV2 can use both cards.
- It's slower per dollar than symmetric pairs.
- It's harder to debug.
- It's a transitional configuration; plan to homogenize within 12 months.
Runtime compatibility
- llama.cpp ✓ best fit. Layer-split via
--tensor-splitaccepts asymmetric ratios. - ExLlamaV2 ✓ good. EXL2 split by ratio works.
- Ollama ✓ functional. GGUF layer-split works under the hood.
- vLLM ✗ poor. Tensor-parallel assumes symmetric cards; performance suffers.
- SGLang ✗ poor. Same issue as vLLM.
Split strategy
Layer-split (pipeline parallelism) is the only reasonable path for this combo. Manually tune the split ratio so the slower card holds proportionally fewer layers.
For a 70B Q4 model with 80 layers: try splitting 36 layers on 3090 / 44 on 4090 (4090 takes more because it's faster). Benchmark and adjust.
Comparison vs symmetric pairs
| Metric | Mixed 4090+3090 | Dual 3090 (NVLink) | Dual 4090 (PCIe) |
|---|---|---|---|
| Decode (70B Q4) | 12-18 tok/s | 25-30 tok/s | 28-35 tok/s |
| Setup time | 8-12 hours | 4-6 hours | 4-6 hours |
| Cost (used 3090 / new 4090) | $2,300-2,800 | $1,200-1,800 | $4,000-5,000 |
Mixed is the worst per-dollar for new-build users. It's only the right answer if you ALREADY own one of the cards.
Related
- /hardware-combos/dual-rtx-3090 — symmetric alternative
- /hardware-combos/dual-rtx-4090 — symmetric upgrade target
- /guides/running-local-ai-on-multiple-gpus-2026 — multi-GPU buying guide
- /tools/llama-cpp — the runtime that handles asymmetric splits best
Best model classes
- 70B-class dense at Q4 with layer-split — llama.cpp layer-split puts different layers on each card; the 4090 handles its layers fast and waits for the 3090. Net throughput is roughly the 3090's solo performance × the layer fraction it owns.
- 32-50B class where you don't need extreme throughput.
For tensor parallelism this combo is suboptimal — the 4090's compute is wasted waiting for the 3090. For layer-split it's adequate.
What this combo is bad at
- Tensor parallelism — synchronization stalls on every all-reduce.
- Maximum-throughput serving — symmetric dual-4090 or dual-3090 outperforms.
- Production environments — operational complexity is high; debugging is harder.
This combo exists because users have a 3090 and want to add a 4090 (or vice-versa) without selling the older card. It's a transitional configuration, not a target build.
Who should avoid this
- Anyone starting from scratch — symmetric dual is always operationally simpler.
- Production users — too many failure surfaces.
- Tensor-parallelism workloads — the asymmetry kills the value.
If you're considering this, check the resale value of one of the cards and whether selling + buying a matching pair is cheaper than the operational pain over 12 months.
Mixed cards in the same chassis create asymmetric thermal load. The 4090 produces more heat; place it in the upper slot for better airflow. Monitor each card's memory-junction temperature independently.
Mixed-vendor / mixed-architecture rigs are inherently less reliable than symmetric pairs because each card's thermal and power behavior is independent. Driver updates can introduce per-card regressions.
Ubuntu 24.04 LTS
Failure modes specific to mixed 4090 + 3090
- Tensor parallelism throughput collapse. vLLM defaults to tensor-parallel; on mixed-GPU it underperforms layer-split by 2-3×. Force pipeline-parallel mode explicitly.
- Driver compatibility regressions. NVIDIA driver updates occasionally introduce per-architecture regressions. Test driver upgrades on a non-production rig first.
- Memory-bandwidth mismatch hitting layers unevenly. llama.cpp's layer-split assumes equal per-layer cost; with asymmetric memory bandwidth this isn't quite true. Manual --tensor-split tuning is required for optimal throughput.
- Power-supply asymmetry. The 4090 spikes to 600W transient; the 3090 to 450W. PSU sizing must accommodate both simultaneously, which can hit 1100W+ momentarily.
- PCIe lane allocation. With two slots populated, both cards typically run at PCIe 4.0 x8. The 4090 doesn't notice; the 3090 doesn't either; this is one of the few areas where mixed builds aren't worse.
Dual Rtx 3090 →
Symmetric pair beats mixed-GPU on per-tok/s economics by 50-80%. Mixed only makes sense if you already own one of the cards.
Mixed RTX 4090 + 3090 workstation →
Asymmetric layer-split via llama.cpp — the operationally honest path when pair-matching isn't an option.
Benchmark opportunities
Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.
llama.cpp layer-split + Mixtral 8x22B (mixed 4090+3090)
mixtral-8x22b-instructMoE on a mixed-GPU rig via llama.cpp layer-split. Mixtral 8x22B (39B-active, 141B total) at Q4_K_M is ~80GB — does NOT fit on dual 24GB cards even with layer-split. Q3_K_M might fit. Marked pending to verify quant fit before measurement.
Going deeper
- All hardware combinations — browse other multi-GPU and multi-machine setups.
- Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
- Distributed inference systems — architectural depth.