Mixed GPUPCIeadvanced

RTX 4090 + RTX 3090 (asymmetric 24+24 GB)

Asymmetric multi-GPU: a 4090 paired with a 3090. PCIe 4.0 only — different SM counts, different memory bandwidth. Effective VRAM is bottlenecked by the slower card on most split strategies.

By Fredoline Eruo · Reviewed 2026-05-06

Try this build in the custom builder

Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.

Open in builder →

Memory budget

Total VRAM

48 GB

Effective for inference

42 GB

88% of total

Not pooled

Mixed-GPU configurations are operationally honest about VRAM but compromised on throughput. The 4090 has 1008 GB/s memory bandwidth vs 936 GB/s on 3090 — close enough for tensor parallelism to work, but the 4090's faster compute is bottlenecked waiting for the 3090 every layer. Effective VRAM is roughly total minus ~3 GB per card (more overhead than symmetric pairs because the runtime loads slightly different weight shards). Tensor parallelism technically works; pipeline parallelism (layer split) works better in practice — different cards do different layers, no synchronization stall on faster card.

Topology

mixed-gpu

Interconnect

pcie~32 GB/s

Component count

2 units

Components

1×rtx-4090
1×rtx-3090

Recommended runtimes

Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.

llama.cpp ExLlamaV2 Ollama

Supported split strategies

How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.

Pipeline parallelLayer split

Why this combo (or why not)

Mixed-GPU builds are the transitional reality of upgrading a single-card rig. You bought a 3090 in 2023, you want to upgrade in 2026, you're not selling the 3090. So you add a 4090.

The honest framing:

This works. llama.cpp layer-split or ExLlamaV2 can use both cards.
It's slower per dollar than symmetric pairs.
It's harder to debug.
It's a transitional configuration; plan to homogenize within 12 months.

Runtime compatibility

llama.cpp ✓ best fit. Layer-split via --tensor-split accepts asymmetric ratios.
ExLlamaV2 ✓ good. EXL2 split by ratio works.
Ollama ✓ functional. GGUF layer-split works under the hood.
vLLM ✗ poor. Tensor-parallel assumes symmetric cards; performance suffers.
SGLang ✗ poor. Same issue as vLLM.

Split strategy

Layer-split (pipeline parallelism) is the only reasonable path for this combo. Manually tune the split ratio so the slower card holds proportionally fewer layers.

For a 70B Q4 model with 80 layers: try splitting 36 layers on 3090 / 44 on 4090 (4090 takes more because it's faster). Benchmark and adjust.

Comparison vs symmetric pairs

Metric	Mixed 4090+3090	Dual 3090 (NVLink)	Dual 4090 (PCIe)
Decode (70B Q4)	12-18 tok/s	25-30 tok/s	28-35 tok/s
Setup time	8-12 hours	4-6 hours	4-6 hours
Cost (used 3090 / new 4090)	$2,300-2,800	$1,200-1,800	$4,000-5,000

Mixed is the worst per-dollar for new-build users. It's only the right answer if you ALREADY own one of the cards.

/hardware-combos/dual-rtx-3090 — symmetric alternative
/hardware-combos/dual-rtx-4090 — symmetric upgrade target
/guides/running-local-ai-on-multiple-gpus-2026 — multi-GPU buying guide
/tools/llama-cpp — the runtime that handles asymmetric splits best

Best model classes

70B-class dense at Q4 with layer-split — llama.cpp layer-split puts different layers on each card; the 4090 handles its layers fast and waits for the 3090. Net throughput is roughly the 3090's solo performance × the layer fraction it owns.
32-50B class where you don't need extreme throughput.

For tensor parallelism this combo is suboptimal — the 4090's compute is wasted waiting for the 3090. For layer-split it's adequate.

What this combo is bad at

Tensor parallelism — synchronization stalls on every all-reduce.
Maximum-throughput serving — symmetric dual-4090 or dual-3090 outperforms.
Production environments — operational complexity is high; debugging is harder.

This combo exists because users have a 3090 and want to add a 4090 (or vice-versa) without selling the older card. It's a transitional configuration, not a target build.

Who should avoid this

Anyone starting from scratch — symmetric dual is always operationally simpler.
Production users — too many failure surfaces.
Tensor-parallelism workloads — the asymmetry kills the value.

If you're considering this, check the resale value of one of the cards and whether selling + buying a matching pair is cheaper than the operational pain over 12 months.

Power & thermal

~800W peak

Mixed cards in the same chassis create asymmetric thermal load. The 4090 produces more heat; place it in the upper slot for better airflow. Monitor each card's memory-junction temperature independently.

Reliability

Mixed-vendor / mixed-architecture rigs are inherently less reliable than symmetric pairs because each card's thermal and power behavior is independent. Driver updates can introduce per-card regressions.

Recommended OS

Ubuntu 24.04 LTS

Operator warning — failure modes

Failure modes specific to mixed 4090 + 3090

Tensor parallelism throughput collapse. vLLM defaults to tensor-parallel; on mixed-GPU it underperforms layer-split by 2-3×. Force pipeline-parallel mode explicitly.
Driver compatibility regressions. NVIDIA driver updates occasionally introduce per-architecture regressions. Test driver upgrades on a non-production rig first.
Memory-bandwidth mismatch hitting layers unevenly. llama.cpp's layer-split assumes equal per-layer cost; with asymmetric memory bandwidth this isn't quite true. Manual --tensor-split tuning is required for optimal throughput.
Power-supply asymmetry. The 4090 spikes to 600W transient; the 3090 to 450W. PSU sizing must accommodate both simultaneously, which can hit 1100W+ momentarily.
PCIe lane allocation. With two slots populated, both cards typically run at PCIe 4.0 x8. The 4090 doesn't notice; the 3090 doesn't either; this is one of the few areas where mixed builds aren't worse.

Closest alternative

Dual Rtx 3090 →

Symmetric pair beats mixed-GPU on per-tok/s economics by 50-80%. Mixed only makes sense if you already own one of the cards.

Featured in stack

Mixed RTX 4090 + 3090 workstation →

Asymmetric layer-split via llama.cpp — the operationally honest path when pair-matching isn't an option.

Benchmark opportunities

Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.

llama.cpp layer-split + Mixtral 8x22B (mixed 4090+3090)

mixtral-8x22b-instruct

pending

Estimate: 10-16 tok/s (asymmetric layer-split)

MoE on a mixed-GPU rig via llama.cpp layer-split. Mixtral 8x22B (39B-active, 141B total) at Q4_K_M is ~80GB — does NOT fit on dual 24GB cards even with layer-split. Q3_K_M might fit. Marked pending to verify quant fit before measurement.

Going deeper

All hardware combinations — browse other multi-GPU and multi-machine setups.
Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
Distributed inference systems — architectural depth.