Multi-GPU: scale beyond a single card
For: Operators with two or more GPUs (matched or mixed) running larger models or higher concurrency. By the end: A multi-GPU inference service running a model that wouldn't fit a single card, with topology choices you can defend.
Two GPUs is not "twice as much" — it's a fundamentally different deployment problem. You now have to think about parallelism strategy, interconnect bandwidth, framework support, and the ways that a model can be split across cards. This path walks from the first decision (do I have matched cards or mixed?) to a working tensor-parallel deployment of a model that wouldn't fit a single GPU.
Inventory your GPUs honestly
Mixed GPUs (a 4090 and a 3090) and matched GPUs (two 4090s) are different stories. Mixed cards limit you to pipeline parallel and to runtimes that allow per-card layer placement (llama.cpp). Matched cards open the door to tensor parallel on vLLM / SGLang. Knowing which case you're in is milestone one.
PCIe lane count matters. Two cards on x8/x8 is fine; one card on x16 and the other on x4 will bottleneck communication during tensor parallel. Look at your motherboard layout, not just the slot count.
Pick the parallelism strategy
Tensor parallel splits each layer across cards — every forward pass crosses the bus, so you want fast interconnect (NVLink > PCIe Gen5 x16 > PCIe Gen4 x16). Pipeline parallel pins layers to specific cards — bus traffic is limited to layer boundaries, but throughput depends on the slowest card. The decision dictates everything that follows.
Default for two matched 24GB cards: tensor parallel size 2. Default for a 4090 + 3090: pipeline parallel via llama.cpp with --tensor-split tuned to the actual VRAM ratio.
Pick the runtime
vLLM is the reference for tensor-parallel serving on matched NVIDIA cards. SGLang's RadixAttention prefix cache makes it 1.3-1.7x faster on agent workloads with stable system prompts. llama.cpp handles mixed cards via --tensor-split and is the only path for asymmetric pairs. ROCm operators read the AMD path (linked below).
One choice. Run with it for a week before evaluating an alternative. Switching runtimes mid-rollout doubles the debugging surface.
Bring up tensor parallel and verify both cards are fed
The trap: vLLM happily starts with TP=1 even when you have two cards, and you'll see one card pegged and one idle. Set --tensor-parallel-size 2 explicitly. During a real workload, nvidia-smi should show both cards above 70% utilization; anything less means the workload isn't large enough or the interconnect is bottlenecking you.
Tune for your interconnect
NCCL_P2P_DISABLE=1 vs unset can be a 30% difference on consumer cards (RTX 30 / 40 / 50 series have nerfed P2P in the driver). NCCL_DEBUG=INFO during startup tells you which transport is being used. Bench with and without P2P; the answer is rarely the same on consumer vs datacenter cards.
For NVLink-paired cards (3090, A100, H100, RTX A6000), P2P and NVLink should both be active and you should see bandwidth in the hundreds of GB/s on nccl-tests.
Pick a model that justifies the second card
Two 24GB cards = 48GB effective VRAM (with overhead). That gets you a 70B model in Q4-Q5, or higher-throughput serving for a 32B with bigger context, or longer context on a 14B. Don't keep running 7B on dual GPUs — that's multi-GPU overhead with single-GPU work.
Profile sustained throughput honestly
"It feels fast" is not a benchmark. Use the runtime's own benchmark tools or a load-tester to measure: TTFT under load, sustained output tokens per second at concurrency 1, 4, 8, 16. Compare against the same model on a single GPU (or the next-larger model). This is the data that justifies the second card.
If you can't justify it, you have one of two problems: wrong model size, or wrong parallelism strategy. Both are fixable.
Restart-proof the deployment
Multi-GPU deployments have more failure modes than single- GPU: one card can hang, one can OOM, NCCL can deadlock. A robust deployment unit detects these, restarts cleanly, and emits a metric. If your service "works fine when you launch it manually," it isn't a service yet.
Next recommended step
The full topology + runtime cross-cut: when to pick which combination, with operator-grade tradeoffs.