RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Will it run?
  4. /RTX 4090 + RTX 3090 (asymmetric 24+24 GB)
Mixed GPUPCIeadvanced

What runs on RTX 4090 + RTX 3090 (asymmetric 24+24 GB)?

Asymmetric multi-GPU: a 4090 paired with a 3090. PCIe 4.0 only — different SM counts, different memory bandwidth. Effective VRAM is bottlenecked by the slower card on most split strategies.

At a glance
Effective VRAM
42 / 48 GB
Not pooled
Speed penalty
~35%
vs ideal single-card
Recommended runtime
llama-cpp
pipeline parallel
Setup difficulty
advanced
~800W peak
24
Models fit
12
Borderline
8
Not practical
Deployment recipe
Mixed RTX 4090 + 3090 workstation →

Asymmetric layer-split via llama.cpp — the operationally honest path when pair-matching isn't an option.

Memory budget
Total VRAM
48 GB
Effective for inference
42 GB
88% of total
Not pooled

Mixed-GPU configurations are operationally honest about VRAM but compromised on throughput. The 4090 has 1008 GB/s memory bandwidth vs 936 GB/s on 3090 — close enough for tensor parallelism to work, but the 4090's faster compute is bottlenecked waiting for the 3090 every layer. Effective VRAM is roughly total minus ~3 GB per card (more overhead than symmetric pairs because the runtime loads slightly different weight shards). Tensor parallelism technically works; pipeline parallelism (layer split) works better in practice — different cards do different layers, no synchronization stall on faster card.

Why total VRAM is not the whole story

Asymmetric cards. Layer-split via llama.cpp distributes by ratio. 48 GB total but 42 GB usable due to runtime overhead and the slower card's bottleneck.

See the multi-GPU guide for the full math + tradeoffs.

Topology

Topology
mixed-gpu
Interconnect
pcie~32 GB/s
Component count
2 units
Components
  • 1×rtx-4090
  • 1×rtx-3090
Recommended runtime
llama-cpp
Also: exllamav2, ollama
Recommended split strategy
pipeline-parallel
Also: layer-split
Setup difficulty
advanced
~800W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

Nemotron 3 Super 49B
Fits
49B·AWQ-INT4 → 32 GB·76% of effective VRAM·~35% speed penalty vs ideal
Mixtral 8x7B Instruct
Fits
47B·Q4_K_M → 32 GB·76% of effective VRAM·~35% speed penalty vs ideal
Command R 35B
Fits
35B·Q4_K_M → 26 GB·62% of effective VRAM·~35% speed penalty vs ideal
Aya 23 35B
Fits
35B·Q4_K_M → 24 GB·57% of effective VRAM·~35% speed penalty vs ideal
Yi 1.5 34B
Fits
34B·Q4_K_M → 24 GB·57% of effective VRAM·~35% speed penalty vs ideal
Phind CodeLlama 34B v2
Fits
34B·Q4_K_M → 24 GB·57% of effective VRAM·~35% speed penalty vs ideal
DeepSeek Coder V3
Fits
33B·AWQ-INT4 → 22 GB·52% of effective VRAM·~35% speed penalty vs ideal
Qwen 3 32B
Fits
32B·Q4_K_M → 24 GB·57% of effective VRAM·~35% speed penalty vs ideal
DeepSeek R1 Distill Qwen 3 32B
Fits
32B·AWQ-INT4 → 22 GB·52% of effective VRAM·~35% speed penalty vs ideal
OLMo 2 32B
Fits
32B·Q4_K_M → 24 GB·57% of effective VRAM·~35% speed penalty vs ideal
Qwen 3 Coder 32B
Fits
32B·AWQ-INT4 → 22 GB·52% of effective VRAM·~35% speed penalty vs ideal
QwQ 32B Preview
Fits
32B·Q4_K_M → 24 GB·57% of effective VRAM·~35% speed penalty vs ideal
Aya Expanse 32B
Fits
32B·AWQ-INT4 → 22 GB·52% of effective VRAM·~35% speed penalty vs ideal
Qwen 2.5 32B Instruct
Fits
32B·Q4_K_M → 24 GB·57% of effective VRAM·~35% speed penalty vs ideal
EXAONE 3.5 32B
Fits
32B·AWQ-INT4 → 22 GB·52% of effective VRAM·~35% speed penalty vs ideal
DeepSeek R1 Distill Qwen 32B
Fits
32B·Q4_K_M → 24 GB·57% of effective VRAM·~35% speed penalty vs ideal
Qwen 2.5 Coder 32B Instruct
Fits
32B·Q4_K_M → 24 GB·57% of effective VRAM·~35% speed penalty vs ideal
Magistral 32B
Fits
32B·AWQ-INT4 → 22 GB·52% of effective VRAM·~35% speed penalty vs ideal
Gemma 4 31B Dense
Fits
31B·Q4_K_M → 24 GB·57% of effective VRAM·~35% speed penalty vs ideal
Qwen 3 30B-A3B
Fits
30B·Q4_K_M → 22 GB·52% of effective VRAM·~35% speed penalty vs ideal
Nemotron 3 Nano (30B-A3B)
Fits
30B·Q4_K_M → 22 GB·52% of effective VRAM·~35% speed penalty vs ideal
Gemma 3 27B
Fits
27B·Q4_K_M → 20 GB·48% of effective VRAM·~35% speed penalty vs ideal
MedGemma 27B
Fits
27B·Q4_K_M → 20 GB·48% of effective VRAM·~35% speed penalty vs ideal
InternVL 2.5 26B
Fits
26B·Q4_K_M → 20 GB·48% of effective VRAM·~35% speed penalty vs ideal

Borderline (12)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

Qwen 2.5 72B Instruct
Borderline
72B·Q4_K_M → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Molmo 72B
Borderline
72B·Q4_K_M → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Qwen 2.5-VL 72B
Borderline
72B·AWQ-INT4 → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Qwen 3 72B
Borderline
72B·AWQ-INT4 → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Qwen 2.5 Math 72B
Borderline
72B·Q4_K_M → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

DeepSeek R1 Distill Llama 70B
Borderline
70B·Q4_K_M → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

OpenBioLLM Llama 3 70B
Borderline
70B·Q4_K_M → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 3.3 70B Instruct
Borderline
70B·Q4_K_M → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Dolphin 3 Llama 3.3 70B
Borderline
70B·AWQ-INT4 → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 3.1 Nemotron 70B Instruct
Borderline
70B·Q4_K_M → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Tulu 3 70B
Borderline
70B·Q4_K_M → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Hermes 4 Llama 3.3 70B
Borderline
70B·AWQ-INT4 → 48 GB·114% of effective VRAM·~35% speed penalty vs ideal

Effective VRAM utilization >114% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Not practical (8)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)
Not practical
1600B·Q4_K_M → 1024 GB·2438% of effective VRAM·~35% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Kimi K2.6
Not practical
1000B·Q4_K_M → 700 GB·1667% of effective VRAM·~35% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Step-3
Not practical
1000B·AWQ-INT4 → 640 GB·1524% of effective VRAM·~35% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V4
Not practical
745B·AWQ-INT4 → 480 GB·1143% of effective VRAM·~35% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Mistral Medium 3.5 (675B MoE)
Not practical
675B·Q4_K_M → 448 GB·1067% of effective VRAM·~35% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek R1 (671B reasoning)
Not practical
671B·Q4_K_M → 420 GB·1000% of effective VRAM·~35% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V3 (671B MoE)
Not practical
671B·Q4_K_M → 420 GB·1000% of effective VRAM·~35% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Llama 4 405B
Not practical
405B·AWQ-INT4 → 280 GB·667% of effective VRAM·~35% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Benchmark opportunities

estimates, not measurements

Pending benchmark targets for this combo. Once measured, results land in the catalog as benchmarks.

llama.cpp layer-split + Mixtral 8x22B (mixed 4090+3090)
pending
Estimate: 10-16 tok/s (asymmetric layer-split)

MoE on a mixed-GPU rig via llama.cpp layer-split. Mixtral 8x22B (39B-active, 141B total) at Q4_K_M is ~80GB — does NOT fit on dual 24GB cards even with layer-split. Q3_K_M might fit. Marked pending to verify quant fit before measurement.

Going deeper

  • Full combo detail page — operational review with failure modes and runtime matrix.
  • Multi-GPU buying guide — when multi-GPU is worth it and when it isn't.
  • Will-it-run home — single-card check + custom builds.