Running local AI on multiple GPUs in 2026
Multi-GPU local AI is the most-Googled, least-honestly-answered topic in the space. This guide covers the math (effective vs total VRAM), the runtime mechanics (tensor vs pipeline parallelism), the recommended builds at each price tier, and the failure modes nobody warns you about.
The myth of pooled VRAM
The single most common misconception in multi-GPU local AI: two 24 GB cards do not give you 48 GB to load a single model into. Total VRAM is not pooled VRAM. Reading benchmarks and tutorials, you'll see “dual 3090, 48 GB total” framed as if those are equivalent, and they're not.
Here's what actually happens. A 70B model at Q4 quantization is roughly 40 GB of weights. On dual RTX 3090 with vLLM, the runtime distributes the weights across the two cards via tensor parallelism — card 0 holds the first half of every layer's weight tensors, card 1 holds the second half. On every forward pass, the activations cross the bus (NVLink if present, PCIe otherwise) so each card can compute its share.
Each card now holds about 20 GB of model weights. But each card also needs its own KV cache (context-dependent, can be 4-8 GB at 32K context for a 70B model), its own activation buffers (1-2 GB), and runtime overhead (1-2 GB). So effective per-card capacity is 24 GB minus ~3 GB overhead = ~21 GB usable for weights and KV. Total effective is ~42 GB, not 48.
The exceptions where total ≈ effective:
- Apple unified memory. The Mac Studio M3 Ultra 192 GB genuinely shares one memory pool between CPU, GPU, and Neural Engine. A 110 GB model fits in the 192 GB pool with room for context.
- NVLink-Switch fabric. DGX-class chassis with H100 SXM modules use a 900 GB/s mesh that approximates pooled behavior — tensor parallelism cost approaches zero.
PCIe + optional NVLink between consumer cards (3090 with bridge, 4090 without) is fundamentally different from these pooled architectures. You get capacity but you pay for it in cross-card communication.
Total VRAM vs effective VRAM
Every hardware-combo page on this site shows both numbers explicitly. Here's how to read them:
- Total VRAM = sum of all GPU VRAM in the system. This is the marketing number on every spec sheet.
- Effective VRAM = the largest model size (in weights) you can actually load with reasonable context, after runtime overhead.
As a rule of thumb for symmetric NVIDIA multi-GPU: effective ≈ total × 0.92. For mixed cards: effective ≈ total × 0.85 (more overhead from asymmetric weight shards). For Apple unified memory: effective ≈ 0.73 × total (macOS reserves system memory). For multi-machine clusters over Thunderbolt or Ethernet: effective ≈ 0.7 × total (cluster overhead, KV cache duplication on some configurations).
Tensor parallelism vs pipeline parallelism
Two fundamentally different ways to split a model across GPUs.
Tensor parallelism
Each layer is split across cards. Card 0 holds the first half of every weight matrix, card 1 the second half. On every forward pass, both cards compute their half, then perform an all-reduce to combine results. The all-reduce traffic is roughly proportional to model hidden size × batch size × tokens.
Tensor parallelism is what vLLM and SGLang default to with --tensor-parallel-size N. It's latency-friendly when interconnect is fast (NVLink, NVLink-Switch); painful when interconnect is slow (PCIe-only on dual 4090, Thunderbolt on Mac cluster, Ethernet on multi-node).
Pipeline parallelism (a.k.a. layer split)
Whole layers go on different cards. Card 0 handles layers 0-39, card 1 handles layers 40-79. On every forward pass, the activation tensor crosses the bus once (when transitioning from card 0's last layer to card 1's first layer), not on every layer.
This is what llama.cpp calls layer split (--tensor-split). Better for asymmetric cards (mixed 4090 + 3090), better over PCIe-only, better for multi-machine. The downside: inherently sequential — card 1 sits idle while card 0 is computing the first half of layers, so single-stream throughput is limited to the per-card throughput, not the aggregate. You only win latency parallelism in the form of higher concurrent throughput.
Pick by interconnect
- NVLink / NVLink-Switch / unified memory → tensor parallelism wins
- PCIe-only / Thunderbolt / Ethernet → pipeline parallelism (layer split) wins
- Mixed cards → pipeline parallelism, always
NVLink vs PCIe — the bandwidth cliff
NVLink 3.0 between two RTX 3090s provides ~112 GB/s of bidirectional bandwidth. PCIe 4.0 x8 provides ~16 GB/s in each direction. That's a 7× difference in cross-card bandwidth, and tensor-parallel workloads feel it directly.
Why this matters operationally:
- Dual 3090 with NVLink: tensor parallelism extracts close to theoretical throughput. 70B Q4 decode at 25-30 tok/s on vLLM.
- Dual 4090 (no NVLink): tensor parallelism over PCIe loses ~10-20% throughput vs theoretical. 70B Q4 decode at 28-35 tok/s — faster than 3090 due to newer architecture, but not as much faster as the spec gap suggests.
- Quad 3090 with paired NVLink (cards 0-1, cards 2-3): tensor-parallel-4 still works but the cross-pair links go over PCIe. Effective throughput per stream is ~20-25 tok/s on 70B; aggregate concurrent throughput scales linearly.
- 4× H100 SXM with NVLink-Switch: 900 GB/s mesh. Tensor-parallel-4 is essentially free. This is what production deployments target.
The RTX 4090 NVLink absence is the single most-misunderstood spec gap in the consumer GPU market. NVIDIA removed it; many tutorials still describe dual 4090 setups as if NVLink were present.
Dual 3090 vs single 4090 vs single 5090 — the most-asked question
The honest framing depends entirely on whether your target model fits in 32 GB at Q4.
| Setup | VRAM | Power | Largest Q4 model | Cost (used 3090 / new) |
|---|---|---|---|---|
| Dual 3090 (NVLink) | 48 GB total / ~46 effective | 700W | 70B | $1,200-1,800 |
| Single 4090 | 24 GB | 450W | 32B | $1,500-2,000 |
| Single 5090 | 32 GB | 575W | 35B | $2,000-2,500 |
| Dual 4090 (no NVLink) | 48 GB total / ~45 effective | 900W | 70B | $4,000-5,000 |
If your peak workload is 32B class: single 5090 is the operator default. Simpler, quieter, less power, newer architecture, full warranty.
If you need 70B class: dual 3090 with NVLink is the cheapest path. Two used 3090s + NVLink bridge come in at ~half the cost of dual 4090, with comparable 70B decode throughput. The dual-4090 only justifies the price premium when you also need the FP8 throughput (some quant formats) or you're on new-card warranty discipline.
Detail pages: dual RTX 3090, dual RTX 4090, mixed 4090 + 3090.
Going beyond two cards
Three or more GPUs in one chassis is where things get interesting. Use cases:
- 100B-class dense models that don't fit in 48 GB
- High-concurrency 70B serving (8+ users)
- MoE models with high expert count
See the quad RTX 3090 build for the prosumer-ceiling configuration: 96 GB total / ~88 GB effective at $3,000-4,500. Above this, you're paying datacenter prices (H100 80 GB at $20k+ used) or distributing across multiple machines.
llama.cpp layer split — the asymmetric path
When pair-matching cards isn't an option (you already own a 3090 and want to add a 4090, or you're building cheap from used parts), llama.cpp's layer-split mode is the only reasonable runtime.
Configuration is via --tensor-split "ratio0,ratio1,...". The ratios should approximate the relative throughput of each card. For mixed RTX 4090 + RTX 3090: try --tensor-split 1.2,1.0 as a starting point (4090 takes slightly more layers because it's faster). Benchmark and adjust.
Performance tradeoff: layer split is sequential by design. Single-stream throughput is roughly the slowest card's solo throughput × the layer fraction it owns. You gain capacity but not single-stream speed.
CPU offload — the last-resort path
When your GPU(s) can't hold the whole model, you can offload some layers to system RAM and have the CPU compute them. Both vLLM and llama.cpp support this.
The throughput penalty is real and painful. CPU compute on dense LLM layers is 5-15× slower per layer than even mid-tier GPUs. A 70B model with 30% layers on CPU drops from 25 tok/s to 3-5 tok/s decode. CPU offload is the “model technically runs” option, not the operational deployment option.
When CPU offload makes sense: testing whether a model is worth investing more hardware in. Running a model occasionally for batch tasks where latency doesn't matter. Rarely otherwise.
Apple unified memory vs NVIDIA multi-GPU
Apple Silicon's unified-memory architecture is genuinely different from NVIDIA's discrete-GPU model. The Mac Studio M3 Ultra 192 GB has one pool of 800 GB/s memory shared between CPU, GPU, and Neural Engine. There's no separate VRAM. A 110 GB model just sits in the pool with the rest of macOS.
The tradeoffs:
- Apple wins on largest-model envelope. 192 GB beats quad RTX 3090 (88 GB effective) and matches H100 80 GB single-card. Mac Studio M4 Ultra (256 GB) beats everything except H100 cluster.
- Apple wins on power and noise. 370W under sustained load vs 1400W for quad 3090. Near-silent vs server-fan loud.
- NVIDIA wins on tokens-per-second. 800 GB/s memory bandwidth is ~25-30% of an RTX 4090's. Per-stream decode is correspondingly slower.
- NVIDIA wins on ecosystem. CUDA, vLLM, SGLang, TensorRT-LLM, every quantization format, every fine-tuning recipe. MLX is good but narrower.
Pick Apple when largest-model-that-runs matters more than tok/s, when CUDA dependency is acceptable to avoid, when your environment values quiet and low-power. Pick NVIDIA multi-GPU when throughput per dollar matters most, when you need CUDA tooling, when you're willing to manage the operational complexity.
Detail pages: Mac Studio M3 Ultra 192GB, quad Mac Mini Exo cluster.
Multi-machine inference: Exo, Hyperspace, Ray Serve
When even a single Mac Studio M3 Ultra 192 GB or quad RTX 3090 isn't enough, you distribute across multiple machines. Three popular paths:
Exo — multi-Mac clusters
Exo shards a model across multiple Apple Silicon Macs over Thunderbolt 4/5. The 4× Mac Mini M4 Pro cluster (~256 GB total, ~180 GB effective) fits 200B-class MoE at the smallest practical price point. Tradeoff: Thunderbolt latency makes it 3-5× slower per stream than a single Mac Studio.
Ray Serve — replica orchestration
Ray Serve doesn't pool capacity for a single model. Instead it orchestrates multiple replicas across machines, each replica holding a full copy of the model. The use case is high-concurrency serving: 4 nodes × 2× RTX 4090 each = 4 replicas of a 70B model, serving aggregate 4× the requests. See the Ray Serve distributed combo.
SGLang distributed
SGLang's distributed mode shards the model across nodes. Use when you need a model larger than any single machine can hold. The counterpart to Ray Serve's replica pattern: SGLang scales model size, Ray Serve scales request throughput.
vLLM tensor-parallel multi-node
vLLM supports multi-node tensor parallelism, but it requires high-bandwidth interconnect (InfiniBand, 100GbE) and careful network tuning. Practical only at datacenter tier.
When multi-GPU is worth it (and when it isn't)
Multi-GPU adds operational complexity. Decide based on:
- Largest model you actually need. If 32B Q4 covers your workload, single 5090 wins. If you need 70B+, multi-GPU starts paying off.
- Concurrency you actually need. If solo developer, multi-GPU is overkill. If team or production, replica orchestration (Ray Serve) compounds.
- Operational tolerance. Multi-GPU debugging is real work. PCIe lanes, NUMA placement, thermals, network — none are free.
- Power and acoustic budget. 1400W of GPU draw isn't living-room compatible.
Recommended builds at each tier
Hobby / solo developer ($2,000-3,000)
Single RTX 5090 + ATX rig. Covers 32B-class workloads. Skip multi-GPU.
Prosumer / power user ($1,500-2,500)
Dual RTX 3090 with NVLink. Used 3090s + bridge + 1000W PSU. Extracts 70B-class at the lowest viable price. See the dual 3090 detail page.
Quiet / largest-model focus ($7,000-10,000)
Mac Studio M3 Ultra 192 GB. 200B-class envelope, near-silent, 370W power. CUDA ecosystem absence is the catch. Detail page.
Homelab / serious throughput ($3,500-5,000)
Quad RTX 3090 in an open-frame or server chassis. 96 GB total / ~88 GB effective. The prosumer ceiling. Plan for basement / server-room placement. Detail page.
Production datacenter ($200,000+)
4× H100 SXM with NVLink-Switch. The reference vLLM tensor-parallel deployment for frontier-MoE serving. Detail page. Most teams should rent (RunPod, Lambda, CoreWeave) before owning at this tier.
Failure modes nobody warns you about
- NVLink bridge looks seated but isn't. NVLink 3.0 bridges are mechanically fragile. A bridge that “clicked” but isn't fully engaged silently falls back to PCIe. Always verify with
nvidia-smi nvlink --statusbefore benchmarking. - PCIe lane starvation on consumer boards. Most consumer chipsets drop the second slot from x16 to x8 when both are populated. Workstation boards (TRX50, W790) maintain x16 but cost real money.
- Power transient overload. Two RTX 4090s spike to 600W each momentarily. A 1200W PSU sized for “2× 450W = 900W” can trip OCP during boost transients. Size for 1500W+ headroom.
- Memory-junction throttling on inner cards. Cards sandwiched between others (slot 2-3 in a quad-card build) hit memory-junction temps over 105°C. Inter-card cooling is mandatory;
nvidia-smireports core temp, not junction temp. - Driver / runtime version mismatch. CUDA, PyTorch, vLLM, transformer engine — every version pin matters. Production deployments need explicit version management.
- NUMA placement. On dual-socket builds, GPUs need to be on the same NUMA node as the inference process. Mismatches cost 30%+ throughput silently.
- Mac OS background processes. Spotlight, Time Machine, Photos all compete for memory and CPU. Production Apple-Silicon deployments must disable them.
Bottom line
Multi-GPU is not a side feature of local AI — it's the dividing line between “I can run a 32B model” and “I can run a 70B+ model.” The hardware combinations on this site exist to answer the second question honestly, with effective-VRAM math, runtime compatibility, and failure modes attached.
If you take only one thing away from this guide: total VRAM is not pooled VRAM. Two cards do not make one bigger card, except on Apple unified memory and NVLink-Switch. Buy your hardware combination with that math in mind, and the rest follows.
Benchmark opportunities
Pending benchmark targets across the multi-GPU combinations on this site. These are editorial estimates — when measured, results land in the catalog as benchmarks. Tracking them here keeps the gap visible and shows operators which configurations need verification.
Single-card baseline for the next-gen consumer flagship. The comparison anchor for all multi-GPU benchmarks: 'how much does multi-GPU actually buy you over the best single card?'
Why this matters: The single-5090 baseline is the comparison anchor for every multi-GPU recommendation on this site. Without it, the 'should I just buy one bigger card?' question can't be answered with confidence.
Frontier MoE on the datacenter reference rig. FP8 fits comfortably in 4× 80GB; expect strong per-stream decode and dramatic concurrency lift via SGLang RadixAttention.
Why this matters: The frontier-MoE production reference. Organizations weighing $200k+ DGX-class purchases vs cloud rental need measured throughput to model cost-per-million-tokens accurately.
Apple Silicon at the frontier-MoE envelope. 17B-active makes this fit comfortably in 192GB unified memory. Bandwidth-bound; expect ~25-30% of NVIDIA tok/s but largest fittable model wins.
Why this matters: The Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.
iPhone 16 Pro on-device benchmark. Apple's MLX Swift example apps demonstrate this approximate range; needs first-party verification with thermal/battery instrumentation.
Why this matters: Mobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'
DeepSeek V4 Flash is the throughput-tuned V4 sibling. 80B/12B-active on 4× H100 should produce strongest open-weight tok/s in 2026.
Why this matters: DeepSeek V4 Flash with the MTP head is claimed to be the throughput leader. Verifying the MTP advantage on production hardware is high-value for V4-Pro-vs-V4-Flash decision-making.
Reference benchmark for the dual-3090 NVLink prosumer build. vLLM tensor-parallel-2, AWQ-INT4, 8K context. Compare against dual-4090 PCIe (no NVLink) to isolate interconnect impact.
Why this matters: The dual-3090 NVLink build is the most-recommended prosumer multi-GPU configuration on this site. Without a measured benchmark, the 25-32 tok/s estimate carries editorial-only confidence — operators making $1,500+ buying decisions deserve real numbers.
Reference benchmark for dual-4090 PCIe (no NVLink). Same model + quant as dual-3090 entry; the comparison reveals NVLink vs PCIe impact at tensor-parallel-2.
Why this matters: Pairs with the dual-3090 measurement to quantify the NVLink-vs-PCIe penalty. The 4090 NVLink absence is the single most-misunderstood spec gap; a measured comparison ends the speculation.
Snapdragon 8 Elite Hexagon NPU running Qualcomm-published Phi-3.5 Mini. Vendor benchmarks suggest this range; needs independent verification with sustained-load + thermal-throttle measurement.
Why this matters: Snapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.
Reasoning workload on quad-3090. R1 distill produces 5-15× more tokens per query; per-stream throughput drops vs same-size non-reasoning model.
Why this matters: Quad-3090 is the prosumer-ceiling stack. R1 reasoning workloads are a high-traffic use case but the thinking-mode token bloat changes the throughput calculation — needed to set realistic operator expectations.
Multi-Mac Exo cluster. 70B at MLX-4bit (~40GB) shards across 4 nodes; Thunderbolt 5 latency dominates. Compare against single Mac Studio M3 Ultra to quantify cluster overhead.
Why this matters: Multi-Mac Exo clusters are an emerging pattern. The cluster-vs-single-Mac-Studio comparison establishes whether the cluster is ever the right answer outside extreme memory targets.
Ray Serve replica orchestration. Each replica runs vLLM tensor-parallel-2; 4 replicas = 4 parallel serving paths. Measure aggregate throughput vs concurrency scan.
Why this matters: Distributed-serving patterns differ from tensor-parallel — replicas scale aggregate concurrency, not single-stream model size. The concurrency scan reveals where Ray Serve replicas plateau.
Copilot+ PC reference SoC. Phi-3.5 Mini via ONNX Runtime DirectML NPU provider. Microsoft's published benchmarks for Phi Silica suggest this range; cross-vendor independent verification needed.
Why this matters: Copilot+ PC ecosystem is rapidly expanding. The Snapdragon X NPU vs Lunar Lake NPU vs CPU-fallback comparison is the operator decision for Windows on-device deployments.
Intel NPU 4 with OpenVINO providing the inference path. Intel's published numbers + Microsoft's Phi Silica benchmarks suggest this range. Independent verification needed.
Why this matters: Lunar Lake is the Intel reference for Copilot+ PCs. Comparison vs Snapdragon X NPU determines which Copilot+ chip operators should prefer for on-device LLMs.
iPad M4 has the highest mobile-class memory bandwidth (120 GB/s) and a 38-TOPS NPU. Cold-start tok/s is well-published; the operator-relevant question is the sustained-load thermal-throttle curve over 30+ min.
Why this matters: Tablet-class on-device viability for journaling / long-form summarization. Needs the throttle curve, not just peak tok/s.
Snapdragon Adreno GPU path via MLC LLM (vs Hexagon NPU via Qualcomm AI Hub). Comparing the same SoC's GPU vs NPU path is the operator decision: NPU wins on power, GPU on flexibility.
Why this matters: MLC LLM is cross-platform and the most-deployed mobile LLM runtime. The Adreno-vs-Hexagon comparison on the same SoC determines whether NPU lock-in is worth the throughput gain.
MoE on a mixed-GPU rig via llama.cpp layer-split. Mixtral 8x22B (39B-active, 141B total) at Q4_K_M is ~80GB — does NOT fit on dual 24GB cards even with layer-split. Q3_K_M might fit. Marked pending to verify quant fit before measurement.
Why this matters: Research-tier benchmark — mixed-GPU is editorial-discouraged for new builds. Useful for users who already own asymmetric pairs.
Frequently asked
Does adding a second GPU double my VRAM?
No. Two 24 GB cards do not give you 48 GB to load a single model into. Each card holds its share of the model via tensor parallelism (vLLM/SGLang) or pipeline parallelism (llama.cpp layer split). Runtime overhead, KV cache, and activations consume per-card VRAM, so effective capacity is roughly the sum minus 2-3 GB per card. The exception is Apple unified memory and NVLink-Switch fabrics — those genuinely pool. PCIe-connected NVIDIA cards do not.
Is dual RTX 3090 better than a single RTX 5090 for local AI?
It depends on whether your target model fits in 32 GB. If it does (32B Q4, 14B FP16, etc.), single 5090 wins on power, simplicity, support life, and benchmarks per dollar. If you need 70B-class (Q4 ~40 GB), dual 3090 with NVLink is the cheapest path to that envelope. Single 5090 with CPU offload technically runs 70B but throughput collapses by 5-10×.
Why doesn't the RTX 4090 have NVLink?
NVIDIA removed it. The 4090 has zero NVLink hardware. Two 4090s communicate only over PCIe, which means cross-card bandwidth is roughly 32 GB/s vs 112 GB/s on dual 3090 NVLink. For tensor parallelism this matters — expect 10-20% throughput penalty vs an NVLink-equipped pair. Many tutorials assume NVLink and produce misleading benchmarks for 4090 builds.
When should I choose Apple Silicon over multi-GPU NVIDIA?
When your priority is largest-model-that-fits at the lowest power and noise budget. Mac Studio M3 Ultra 192GB runs 200B-class models that don't fit on quad RTX 3090, at a quarter of the power draw, near-silent. The tradeoff: 25-30% the tokens-per-second of comparable NVIDIA, no CUDA ecosystem, smaller quantization toolchain. For research and largest-model envelope: Apple. For maximum throughput per dollar: NVIDIA multi-GPU.
What's the difference between tensor parallelism and pipeline parallelism?
Tensor parallelism splits each layer across GPUs — every card holds half the weights of every layer, and activations cross the bus on every forward pass. Latency-friendly with NVLink, painful over PCIe-only. Pipeline parallelism (llama.cpp 'layer split') puts whole layers on different cards — card 1 does layers 0-39, card 2 does layers 40-79. Cross-card traffic is once per token, not every layer. Better for asymmetric setups and PCIe-only rigs; worse for tight latency.
Can I mix a 4090 and a 3090 in the same system?
Technically yes; operationally inadvisable. llama.cpp layer-split accepts asymmetric ratios and works. vLLM tensor-parallel assumes symmetric cards and underperforms. Most importantly, you waste the 4090's compute waiting for the 3090 every layer. If you already own one of the cards, layer-split via llama.cpp is acceptable. If building from scratch, pair-match instead.
What's an Exo cluster and when does it make sense?
Exo distributes a single model across multiple Mac machines, sharding layers via Thunderbolt 4/5. A 4× Mac Mini M4 Pro Exo cluster gives 256 GB total memory. The use case: 200B+ class models that don't fit in a single Mac Studio M3 Ultra 192GB. The tradeoffs: Thunderbolt latency makes it 3-5× slower than the single Mac Studio, and operational complexity is significant. Only worth it when you genuinely need >192 GB unified memory.
Is multi-GPU worth it for hobby or solo-developer use?
Usually not, unless you specifically need 70B+ models. The operational complexity of multi-GPU (PCIe lane management, thermal management, runtime configuration, debugging) is real. For most hobby use cases, a single RTX 4090 or 5090 covers 32B-class workloads cleanly. Multi-GPU starts paying off when (a) you need a model that doesn't fit on one card, or (b) you're serving multiple concurrent users.
Going deeper
- Hardware combinations index — every combo with operator-grade detail.
- Distributed inference systems — architectural depth.
- Choosing a GPU for local AI in 2026 — single-card buying guide companion.
- Execution stacks — deployment recipes with components, runtime, and benchmarks.
The landscape has shifted significantly over the past two years. Pooling two RTX 3090s gives you 48 GB of VRAM, which unlocks 70B parameter models at 4-bit quantization with room for a 32k context window — a configuration that was confined to expensive professional cards just a generation ago. The trade-off is power draw, case real estate, and PCIe configuration complexity that a single card like the RTX 4090 sidesteps entirely. Understanding which path fits your physical build is half the decision before you even compare numbers.
The closest single-card comparison worth studying: RTX 4090 vs dual RTX 3090 comparison.