Buying & deployment guide · Multi-GPU

Running local AI on multiple GPUs in 2026

Multi-GPU local AI is the most-Googled, least-honestly-answered topic in the space. This guide covers the math (effective vs total VRAM), the runtime mechanics (tensor vs pipeline parallelism), the recommended builds at each price tier, and the failure modes nobody warns you about.

By Fred Oline·Last verified May 6, 2026

The myth of pooled VRAM

The single most common misconception in multi-GPU local AI: two 24 GB cards do not give you 48 GB to load a single model into. Total VRAM is not pooled VRAM. Reading benchmarks and tutorials, you'll see “dual 3090, 48 GB total” framed as if those are equivalent, and they're not.

Here's what actually happens. A 70B model at Q4 quantization is roughly 40 GB of weights. On dual RTX 3090 with vLLM, the runtime distributes the weights across the two cards via tensor parallelism — card 0 holds the first half of every layer's weight tensors, card 1 holds the second half. On every forward pass, the activations cross the bus (NVLink if present, PCIe otherwise) so each card can compute its share.

Each card now holds about 20 GB of model weights. But each card also needs its own KV cache (context-dependent, can be 4-8 GB at 32K context for a 70B model), its own activation buffers (1-2 GB), and runtime overhead (1-2 GB). So effective per-card capacity is 24 GB minus ~3 GB overhead = ~21 GB usable for weights and KV. Total effective is ~42 GB, not 48.

The exceptions where total ≈ effective:

Apple unified memory. The Mac Studio M3 Ultra 192 GB genuinely shares one memory pool between CPU, GPU, and Neural Engine. A 110 GB model fits in the 192 GB pool with room for context.
NVLink-Switch fabric. DGX-class chassis with H100 SXM modules use a 900 GB/s mesh that approximates pooled behavior — tensor parallelism cost approaches zero.

PCIe + optional NVLink between consumer cards (3090 with bridge, 4090 without) is fundamentally different from these pooled architectures. You get capacity but you pay for it in cross-card communication.

Total VRAM vs effective VRAM

Every hardware-combo page on this site shows both numbers explicitly. Here's how to read them:

Total VRAM = sum of all GPU VRAM in the system. This is the marketing number on every spec sheet.
Effective VRAM = the largest model size (in weights) you can actually load with reasonable context, after runtime overhead.

As a rule of thumb for symmetric NVIDIA multi-GPU: effective ≈ total × 0.92. For mixed cards: effective ≈ total × 0.85 (more overhead from asymmetric weight shards). For Apple unified memory: effective ≈ 0.73 × total (macOS reserves system memory). For multi-machine clusters over Thunderbolt or Ethernet: effective ≈ 0.7 × total (cluster overhead, KV cache duplication on some configurations).

Tensor parallelism is bandwidth-bound. NVLink lets weights and gradients move between GPUs at order-of-magnitude higher rates than PCIe, and networked nodes drop another order of magnitude. Match the topology to the parallelism strategy.

Tensor parallelism vs pipeline parallelism

Two fundamentally different ways to split a model across GPUs.

Tensor parallelism

Each layer is split across cards. Card 0 holds the first half of every weight matrix, card 1 the second half. On every forward pass, both cards compute their half, then perform an all-reduce to combine results. The all-reduce traffic is roughly proportional to model hidden size × batch size × tokens.

Tensor parallelism is what vLLM and SGLang default to with --tensor-parallel-size N. It's latency-friendly when interconnect is fast (NVLink, NVLink-Switch); painful when interconnect is slow (PCIe-only on dual 4090, Thunderbolt on Mac cluster, Ethernet on multi-node).

Pipeline parallelism (a.k.a. layer split)

Whole layers go on different cards. Card 0 handles layers 0-39, card 1 handles layers 40-79. On every forward pass, the activation tensor crosses the bus once (when transitioning from card 0's last layer to card 1's first layer), not on every layer.

This is what llama.cpp calls layer split (--tensor-split). Better for asymmetric cards (mixed 4090 + 3090), better over PCIe-only, better for multi-machine. The downside: inherently sequential — card 1 sits idle while card 0 is computing the first half of layers, so single-stream throughput is limited to the per-card throughput, not the aggregate. You only win latency parallelism in the form of higher concurrent throughput.

Pick by interconnect

NVLink / NVLink-Switch / unified memory → tensor parallelism wins
PCIe-only / Thunderbolt / Ethernet → pipeline parallelism (layer split) wins
Mixed cards → pipeline parallelism, always

NVLink vs PCIe — the bandwidth cliff

NVLink 3.0 between two RTX 3090s provides ~112 GB/s of bidirectional bandwidth. PCIe 4.0 x8 provides ~16 GB/s in each direction. That's a 7× difference in cross-card bandwidth, and tensor-parallel workloads feel it directly.

Why this matters operationally:

Dual 3090 with NVLink: tensor parallelism extracts close to theoretical throughput. 70B Q4 decode at 25-30 tok/s on vLLM.
Dual 4090 (no NVLink): tensor parallelism over PCIe loses ~10-20% throughput vs theoretical. 70B Q4 decode at 28-35 tok/s — faster than 3090 due to newer architecture, but not as much faster as the spec gap suggests.
Quad 3090 with paired NVLink (cards 0-1, cards 2-3): tensor-parallel-4 still works but the cross-pair links go over PCIe. Effective throughput per stream is ~20-25 tok/s on 70B; aggregate concurrent throughput scales linearly.
4× H100 SXM with NVLink-Switch: 900 GB/s mesh. Tensor-parallel-4 is essentially free. This is what production deployments target.

The RTX 4090 NVLink absence is the single most-misunderstood spec gap in the consumer GPU market. NVIDIA removed it; many tutorials still describe dual 4090 setups as if NVLink were present.

Dual 3090 vs single 4090 vs single 5090 — the most-asked question

The honest framing depends entirely on whether your target model fits in 32 GB at Q4.

Setup	VRAM	Power	Largest Q4 model	Cost (used 3090 / new)
Dual 3090 (NVLink)	48 GB total / ~46 effective	700W	70B	$1,200-1,800
Single 4090	24 GB	450W	32B	$1,500-2,000
Single 5090	32 GB	575W	35B	$2,000-2,500
Dual 4090 (no NVLink)	48 GB total / ~45 effective	900W	70B	$4,000-5,000

If your peak workload is 32B class: single 5090 is the operator default. Simpler, quieter, less power, newer architecture, full warranty.

If you need 70B class: dual 3090 with NVLink is the cheapest path. Two used 3090s + NVLink bridge come in at ~half the cost of dual 4090, with comparable 70B decode throughput. The dual-4090 only justifies the price premium when you also need the FP8 throughput (some quant formats) or you're on new-card warranty discipline.

Detail pages: dual RTX 3090, dual RTX 4090, mixed 4090 + 3090.

Going beyond two cards

Three or more GPUs in one chassis is where things get interesting. Use cases:

100B-class dense models that don't fit in 48 GB
High-concurrency 70B serving (8+ users)
MoE models with high expert count

See the quad RTX 3090 build for the prosumer-ceiling configuration: 96 GB total / ~88 GB effective at $3,000-4,500. Above this, you're paying datacenter prices (H100 80 GB at $20k+ used) or distributing across multiple machines.

llama.cpp layer split — the asymmetric path

When pair-matching cards isn't an option (you already own a 3090 and want to add a 4090, or you're building cheap from used parts), llama.cpp's layer-split mode is the only reasonable runtime.

Configuration is via --tensor-split "ratio0,ratio1,...". The ratios should approximate the relative throughput of each card. For mixed RTX 4090 + RTX 3090: try --tensor-split 1.2,1.0 as a starting point (4090 takes slightly more layers because it's faster). Benchmark and adjust.

Performance tradeoff: layer split is sequential by design. Single-stream throughput is roughly the slowest card's solo throughput × the layer fraction it owns. You gain capacity but not single-stream speed.

CPU offload — the last-resort path

When your GPU(s) can't hold the whole model, you can offload some layers to system RAM and have the CPU compute them. Both vLLM and llama.cpp support this.

The throughput penalty is real and painful. CPU compute on dense LLM layers is 5-15× slower per layer than even mid-tier GPUs. A 70B model with 30% layers on CPU drops from 25 tok/s to 3-5 tok/s decode. CPU offload is the “model technically runs” option, not the operational deployment option.

When CPU offload makes sense: testing whether a model is worth investing more hardware in. Running a model occasionally for batch tasks where latency doesn't matter. Rarely otherwise.

Apple unified memory vs NVIDIA multi-GPU

Apple Silicon's unified-memory architecture is genuinely different from NVIDIA's discrete-GPU model. The Mac Studio M3 Ultra 192 GB has one pool of 800 GB/s memory shared between CPU, GPU, and Neural Engine. There's no separate VRAM. A 110 GB model just sits in the pool with the rest of macOS.

The tradeoffs:

Apple wins on largest-model envelope. 192 GB beats quad RTX 3090 (88 GB effective) and matches H100 80 GB single-card. Mac Studio M4 Ultra (256 GB) beats everything except H100 cluster.
Apple wins on power and noise. 370W under sustained load vs 1400W for quad 3090. Near-silent vs server-fan loud.
NVIDIA wins on tokens-per-second. 800 GB/s memory bandwidth is ~25-30% of an RTX 4090's. Per-stream decode is correspondingly slower.
NVIDIA wins on ecosystem. CUDA, vLLM, SGLang, TensorRT-LLM, every quantization format, every fine-tuning recipe. MLX is good but narrower.

Pick Apple when largest-model-that-runs matters more than tok/s, when CUDA dependency is acceptable to avoid, when your environment values quiet and low-power. Pick NVIDIA multi-GPU when throughput per dollar matters most, when you need CUDA tooling, when you're willing to manage the operational complexity.

Detail pages: Mac Studio M3 Ultra 192GB, quad Mac Mini Exo cluster.

Multi-machine inference: Exo, Hyperspace, Ray Serve

When even a single Mac Studio M3 Ultra 192 GB or quad RTX 3090 isn't enough, you distribute across multiple machines. Three popular paths:

Exo — multi-Mac clusters

Exo shards a model across multiple Apple Silicon Macs over Thunderbolt 4/5. The 4× Mac Mini M4 Pro cluster (~256 GB total, ~180 GB effective) fits 200B-class MoE at the smallest practical price point. Tradeoff: Thunderbolt latency makes it 3-5× slower per stream than a single Mac Studio.

Ray Serve — replica orchestration

Ray Serve doesn't pool capacity for a single model. Instead it orchestrates multiple replicas across machines, each replica holding a full copy of the model. The use case is high-concurrency serving: 4 nodes × 2× RTX 4090 each = 4 replicas of a 70B model, serving aggregate 4× the requests. See the Ray Serve distributed combo.

SGLang distributed

SGLang's distributed mode shards the model across nodes. Use when you need a model larger than any single machine can hold. The counterpart to Ray Serve's replica pattern: SGLang scales model size, Ray Serve scales request throughput.

vLLM tensor-parallel multi-node

vLLM supports multi-node tensor parallelism, but it requires high-bandwidth interconnect (InfiniBand, 100GbE) and careful network tuning. Practical only at datacenter tier.

When multi-GPU is worth it (and when it isn't)

Multi-GPU adds operational complexity. Decide based on:

Largest model you actually need. If 32B Q4 covers your workload, single 5090 wins. If you need 70B+, multi-GPU starts paying off.
Concurrency you actually need. If solo developer, multi-GPU is overkill. If team or production, replica orchestration (Ray Serve) compounds.
Operational tolerance. Multi-GPU debugging is real work. PCIe lanes, NUMA placement, thermals, network — none are free.
Power and acoustic budget. 1400W of GPU draw isn't living-room compatible.

Recommended builds at each tier

Hobby / solo developer ($2,000-3,000)

Single RTX 5090 + ATX rig. Covers 32B-class workloads. Skip multi-GPU.

Prosumer / power user ($1,500-2,500)

Dual RTX 3090 with NVLink. Used 3090s + bridge + 1000W PSU. Extracts 70B-class at the lowest viable price. See the dual 3090 detail page.

Quiet / largest-model focus ($7,000-10,000)

Mac Studio M3 Ultra 192 GB. 200B-class envelope, near-silent, 370W power. CUDA ecosystem absence is the catch. Detail page.

Homelab / serious throughput ($3,500-5,000)

Quad RTX 3090 in an open-frame or server chassis. 96 GB total / ~88 GB effective. The prosumer ceiling. Plan for basement / server-room placement. Detail page.

Production datacenter ($200,000+)

4× H100 SXM with NVLink-Switch. The reference vLLM tensor-parallel deployment for frontier-MoE serving. Detail page. Most teams should rent (RunPod, Lambda, CoreWeave) before owning at this tier.

Failure modes nobody warns you about

NVLink bridge looks seated but isn't. NVLink 3.0 bridges are mechanically fragile. A bridge that “clicked” but isn't fully engaged silently falls back to PCIe. Always verify with nvidia-smi nvlink --status before benchmarking.
PCIe lane starvation on consumer boards. Most consumer chipsets drop the second slot from x16 to x8 when both are populated. Workstation boards (TRX50, W790) maintain x16 but cost real money.
Power transient overload. Two RTX 4090s spike to 600W each momentarily. A 1200W PSU sized for “2× 450W = 900W” can trip OCP during boost transients. Size for 1500W+ headroom.
Memory-junction throttling on inner cards. Cards sandwiched between others (slot 2-3 in a quad-card build) hit memory-junction temps over 105°C. Inter-card cooling is mandatory; nvidia-smi reports core temp, not junction temp.
Driver / runtime version mismatch. CUDA, PyTorch, vLLM, transformer engine — every version pin matters. Production deployments need explicit version management.
NUMA placement. On dual-socket builds, GPUs need to be on the same NUMA node as the inference process. Mismatches cost 30%+ throughput silently.
Mac OS background processes. Spotlight, Time Machine, Photos all compete for memory and CPU. Production Apple-Silicon deployments must disable them.

Bottom line

Multi-GPU is not a side feature of local AI — it's the dividing line between “I can run a 32B model” and “I can run a 70B+ model.” The hardware combinations on this site exist to answer the second question honestly, with effective-VRAM math, runtime compatibility, and failure modes attached.

If you take only one thing away from this guide: total VRAM is not pooled VRAM. Two cards do not make one bigger card, except on Apple unified memory and NVLink-Switch. Buy your hardware combination with that math in mind, and the rest follows.

Benchmark opportunities

estimates, not measurements

Pending benchmark targets across the multi-GPU combinations on this site. These are editorial estimates — when measured, results land in the catalog as benchmarks. Tracking them here keeps the gap visible and shows operators which configurations need verification.

Single RTX 5090 + Qwen 3 Coder 32B (vLLM, AWQ-INT4)

critical

Pending

qwen-3-coder-32b·vllm·AWQ-INT4

Estimate: 55-75 tok/s decode (single stream)

Single-card baseline for the next-gen consumer flagship. The comparison anchor for all multi-GPU benchmarks: 'how much does multi-GPU actually buy you over the best single card?'

Why this matters: The single-5090 baseline is the comparison anchor for every multi-GPU recommendation on this site. Without it, the 'should I just buy one bigger card?' question can't be answered with confidence.

4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)

high

Pending

qwen-3.5-235b-a17b·vllm-tensor-parallel-h100-workstation·vllm·FP8

Estimate: 60-90 tok/s decode (single stream)

Frontier MoE on the datacenter reference rig. FP8 fits comfortably in 4× 80GB; expect strong per-stream decode and dramatic concurrency lift via SGLang RadixAttention.

Why this matters: The frontier-MoE production reference. Organizations weighing $200k+ DGX-class purchases vs cloud rental need measured throughput to model cost-per-million-tokens accurately.

Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)

high

Pending

qwen-3.5-235b-a17b·mac-studio-m3-ultra-192gb·mlx-lm·MLX-4bit

Estimate: 8-14 tok/s decode (single stream)

Apple Silicon at the frontier-MoE envelope. 17B-active makes this fit comfortably in 192GB unified memory. Bandwidth-bound; expect ~25-30% of NVIDIA tok/s but largest fittable model wins.

Why this matters: The Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.

iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)

high

Pending

llama-3.2-3b-instruct·mlx-swift·MLX-INT4

Estimate: 8-15 tok/s decode (estimate, sustained)

iPhone 16 Pro on-device benchmark. Apple's MLX Swift example apps demonstrate this approximate range; needs first-party verification with thermal/battery instrumentation.

Why this matters: Mobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'

4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)

high

Pending

deepseek-v4-flash·vllm-tensor-parallel-h100-workstation·vllm·AWQ-INT4

Estimate: 100-160 tok/s decode (single stream)

DeepSeek V4 Flash is the throughput-tuned V4 sibling. 80B/12B-active on 4× H100 should produce strongest open-weight tok/s in 2026.

Why this matters: DeepSeek V4 Flash with the MTP head is claimed to be the throughput leader. Verifying the MTP advantage on production hardware is high-value for V4-Pro-vs-V4-Flash decision-making.

Dual RTX 3090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)

high

Pending

llama-3.3-70b-instruct·dual-rtx-3090·vllm·AWQ-INT4

Estimate: 25-32 tok/s decode (NVLink)

Reference benchmark for the dual-3090 NVLink prosumer build. vLLM tensor-parallel-2, AWQ-INT4, 8K context. Compare against dual-4090 PCIe (no NVLink) to isolate interconnect impact.

Why this matters: The dual-3090 NVLink build is the most-recommended prosumer multi-GPU configuration on this site. Without a measured benchmark, the 25-32 tok/s estimate carries editorial-only confidence — operators making $1,500+ buying decisions deserve real numbers.

Dual RTX 4090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)

high

Pending

llama-3.3-70b-instruct·dual-rtx-4090·vllm·AWQ-INT4

Estimate: 28-36 tok/s decode (PCIe only)

Reference benchmark for dual-4090 PCIe (no NVLink). Same model + quant as dual-3090 entry; the comparison reveals NVLink vs PCIe impact at tensor-parallel-2.

Why this matters: Pairs with the dual-3090 measurement to quantify the NVLink-vs-PCIe penalty. The 4090 NVLink absence is the single most-misunderstood spec gap; a measured comparison ends the speculation.

Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)

high

Pending

phi-3.5-mini-instruct·qualcomm-ai-hub·INT8

Estimate: 12-25 tok/s decode (Hexagon NPU, estimate)

Snapdragon 8 Elite Hexagon NPU running Qualcomm-published Phi-3.5 Mini. Vendor benchmarks suggest this range; needs independent verification with sustained-load + thermal-throttle measurement.

Why this matters: Snapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.

4× RTX 3090 + DeepSeek R1 Distill Llama 70B (vLLM TP-4)

Pending

deepseek-r1-distill-llama-70b·quad-rtx-3090·vllm·AWQ-INT4

Estimate: 20-28 tok/s decode (with thinking-mode bloat)

Reasoning workload on quad-3090. R1 distill produces 5-15× more tokens per query; per-stream throughput drops vs same-size non-reasoning model.

Why this matters: Quad-3090 is the prosumer-ceiling stack. R1 reasoning workloads are a high-traffic use case but the thinking-mode token bloat changes the throughput calculation — needed to set realistic operator expectations.

4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)

Pending

llama-3.1-70b-instruct·quad-mac-mini-m4-pro-exo·exo·MLX-4bit

Estimate: 4-9 tok/s decode (Thunderbolt 5 inter-node)

Multi-Mac Exo cluster. 70B at MLX-4bit (~40GB) shards across 4 nodes; Thunderbolt 5 latency dominates. Compare against single Mac Studio M3 Ultra to quantify cluster overhead.

Why this matters: Multi-Mac Exo clusters are an emerging pattern. The cluster-vs-single-Mac-Studio comparison establishes whether the cluster is ever the right answer outside extreme memory targets.

Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)

Pending

qwen-3-32b·ray-serve-distributed-multi-node·ray-serve·AWQ-INT4

Estimate: 30-45 tok/s per stream × 4-32 concurrent

Ray Serve replica orchestration. Each replica runs vLLM tensor-parallel-2; 4 replicas = 4 parallel serving paths. Measure aggregate throughput vs concurrency scan.

Why this matters: Distributed-serving patterns differ from tensor-parallel — replicas scale aggregate concurrency, not single-stream model size. The concurrency scan reveals where Ray Serve replicas plateau.

Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)

Pending

phi-3.5-mini-instruct·onnx-runtime-mobile·INT8

Estimate: 20-40 tok/s decode (estimate)

Copilot+ PC reference SoC. Phi-3.5 Mini via ONNX Runtime DirectML NPU provider. Microsoft's published benchmarks for Phi Silica suggest this range; cross-vendor independent verification needed.

Why this matters: Copilot+ PC ecosystem is rapidly expanding. The Snapdragon X NPU vs Lunar Lake NPU vs CPU-fallback comparison is the operator decision for Windows on-device deployments.

Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)

Pending

phi-3.5-mini-instruct·onnx-runtime-mobile·INT8

Estimate: 18-35 tok/s decode (estimate)

Intel NPU 4 with OpenVINO providing the inference path. Intel's published numbers + Microsoft's Phi Silica benchmarks suggest this range. Independent verification needed.

Why this matters: Lunar Lake is the Intel reference for Copilot+ PCs. Comparison vs Snapdragon X NPU determines which Copilot+ chip operators should prefer for on-device LLMs.

iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)

Pending

qwen-2.5-3b-instruct·mlx-lm·MLX-4bit

Estimate: 20-35 tok/s decode (cold); throttle curve TBD

iPad M4 has the highest mobile-class memory bandwidth (120 GB/s) and a 38-TOPS NPU. Cold-start tok/s is well-published; the operator-relevant question is the sustained-load thermal-throttle curve over 30+ min.

Why this matters: Tablet-class on-device viability for journaling / long-form summarization. Needs the throttle curve, not just peak tok/s.

Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)

Pending

llama-3.2-3b-instruct·mlc-llm·Q4_K_M (TVM-quant)

Estimate: 10-22 tok/s decode (Adreno GPU path)

Snapdragon Adreno GPU path via MLC LLM (vs Hexagon NPU via Qualcomm AI Hub). Comparing the same SoC's GPU vs NPU path is the operator decision: NPU wins on power, GPU on flexibility.

Why this matters: MLC LLM is cross-platform and the most-deployed mobile LLM runtime. The Adreno-vs-Hexagon comparison on the same SoC determines whether NPU lock-in is worth the throughput gain.

llama.cpp layer-split + Mixtral 8x22B (mixed 4090+3090)

low

Pending

mixtral-8x22b-instruct·mixed-4090-3090·llama-cpp·Q4_K_M

Estimate: 10-16 tok/s (asymmetric layer-split)

MoE on a mixed-GPU rig via llama.cpp layer-split. Mixtral 8x22B (39B-active, 141B total) at Q4_K_M is ~80GB — does NOT fit on dual 24GB cards even with layer-split. Q3_K_M might fit. Marked pending to verify quant fit before measurement.

Why this matters: Research-tier benchmark — mixed-GPU is editorial-discouraged for new builds. Useful for users who already own asymmetric pairs.

Frequently asked

Does adding a second GPU double my VRAM?

No. Two 24 GB cards do not give you 48 GB to load a single model into. Each card holds its share of the model via tensor parallelism (vLLM/SGLang) or pipeline parallelism (llama.cpp layer split). Runtime overhead, KV cache, and activations consume per-card VRAM, so effective capacity is roughly the sum minus 2-3 GB per card. The exception is Apple unified memory and NVLink-Switch fabrics — those genuinely pool. PCIe-connected NVIDIA cards do not.

Is dual RTX 3090 better than a single RTX 5090 for local AI?

It depends on whether your target model fits in 32 GB. If it does (32B Q4, 14B FP16, etc.), single 5090 wins on power, simplicity, support life, and benchmarks per dollar. If you need 70B-class (Q4 ~40 GB), dual 3090 with NVLink is the cheapest path to that envelope. Single 5090 with CPU offload technically runs 70B but throughput collapses by 5-10×.

Why doesn't the RTX 4090 have NVLink?

NVIDIA removed it. The 4090 has zero NVLink hardware. Two 4090s communicate only over PCIe, which means cross-card bandwidth is roughly 32 GB/s vs 112 GB/s on dual 3090 NVLink. For tensor parallelism this matters — expect 10-20% throughput penalty vs an NVLink-equipped pair. Many tutorials assume NVLink and produce misleading benchmarks for 4090 builds.

When should I choose Apple Silicon over multi-GPU NVIDIA?

When your priority is largest-model-that-fits at the lowest power and noise budget. Mac Studio M3 Ultra 192GB runs 200B-class models that don't fit on quad RTX 3090, at a quarter of the power draw, near-silent. The tradeoff: 25-30% the tokens-per-second of comparable NVIDIA, no CUDA ecosystem, smaller quantization toolchain. For research and largest-model envelope: Apple. For maximum throughput per dollar: NVIDIA multi-GPU.

What's the difference between tensor parallelism and pipeline parallelism?

Tensor parallelism splits each layer across GPUs — every card holds half the weights of every layer, and activations cross the bus on every forward pass. Latency-friendly with NVLink, painful over PCIe-only. Pipeline parallelism (llama.cpp 'layer split') puts whole layers on different cards — card 1 does layers 0-39, card 2 does layers 40-79. Cross-card traffic is once per token, not every layer. Better for asymmetric setups and PCIe-only rigs; worse for tight latency.

Can I mix a 4090 and a 3090 in the same system?

Technically yes; operationally inadvisable. llama.cpp layer-split accepts asymmetric ratios and works. vLLM tensor-parallel assumes symmetric cards and underperforms. Most importantly, you waste the 4090's compute waiting for the 3090 every layer. If you already own one of the cards, layer-split via llama.cpp is acceptable. If building from scratch, pair-match instead.

What's an Exo cluster and when does it make sense?

Exo distributes a single model across multiple Mac machines, sharding layers via Thunderbolt 4/5. A 4× Mac Mini M4 Pro Exo cluster gives 256 GB total memory. The use case: 200B+ class models that don't fit in a single Mac Studio M3 Ultra 192GB. The tradeoffs: Thunderbolt latency makes it 3-5× slower than the single Mac Studio, and operational complexity is significant. Only worth it when you genuinely need >192 GB unified memory.

Is multi-GPU worth it for hobby or solo-developer use?

Usually not, unless you specifically need 70B+ models. The operational complexity of multi-GPU (PCIe lane management, thermal management, runtime configuration, debugging) is real. For most hobby use cases, a single RTX 4090 or 5090 covers 32B-class workloads cleanly. Multi-GPU starts paying off when (a) you need a model that doesn't fit on one card, or (b) you're serving multiple concurrent users.

Going deeper

Hardware combinations index — every combo with operator-grade detail.
Distributed inference systems — architectural depth.
Choosing a GPU for local AI in 2026 — single-card buying guide companion.
Execution stacks — deployment recipes with components, runtime, and benchmarks.

The landscape has shifted significantly over the past two years. Pooling two RTX 3090s gives you 48 GB of VRAM, which unlocks 70B parameter models at 4-bit quantization with room for a 32k context window — a configuration that was confined to expensive professional cards just a generation ago. The trade-off is power draw, case real estate, and PCIe configuration complexity that a single card like the RTX 4090 sidesteps entirely. Understanding which path fits your physical build is half the decision before you even compare numbers.

The closest single-card comparison worth studying: RTX 4090 vs dual RTX 3090 comparison.