Operator path

Operator-reviewed

Multi-GPU: scale beyond a single card

For: Operators with two or more GPUs (matched or mixed) running larger models or higher concurrency. By the end: A multi-GPU inference service running a model that wouldn't fit a single card, with topology choices you can defend.

By Fredoline Eruo8 milestonesLast reviewed 2026-05-07

Two GPUs is not "twice as much" — it's a fundamentally different deployment problem. You now have to think about parallelism strategy, interconnect bandwidth, framework support, and the ways that a model can be split across cards. This path walks from the first decision (do I have matched cards or mixed?) to a working tensor-parallel deployment of a model that wouldn't fit a single GPU.

Inventory your GPUs honestly

Mixed GPUs (a 4090 and a 3090) and matched GPUs (two 4090s) are different stories. Mixed cards limit you to pipeline parallel and to runtimes that allow per-card layer placement (llama.cpp). Matched cards open the door to tensor parallel on vLLM / SGLang. Knowing which case you're in is milestone one.

PCIe lane count matters. Two cards on x8/x8 is fine; one card on x16 and the other on x4 will bottleneck communication during tensor parallel. Look at your motherboard layout, not just the slot count.

When this is done you should have

A spec sheet of every GPU in the machine: VRAM, generation, memory bandwidth, and PCIe lane count. NVLink confirmed if present.

Read nextHardware combos index Hardware index

Pick the parallelism strategy

Tensor parallel splits each layer across cards — every forward pass crosses the bus, so you want fast interconnect (NVLink > PCIe Gen5 x16 > PCIe Gen4 x16). Pipeline parallel pins layers to specific cards — bus traffic is limited to layer boundaries, but throughput depends on the slowest card. The decision dictates everything that follows.

Default for two matched 24GB cards: tensor parallel size 2. Default for a 4090 + 3090: pipeline parallel via llama.cpp with --tensor-split tuned to the actual VRAM ratio.

When this is done you should have

A clear answer for your config: tensor parallel (matched cards on a single host) or pipeline parallel (mixed cards, or across hosts).

Read nextMulti-GPU guide 2026 Distributed inference system

Pick the runtime

vLLM is the reference for tensor-parallel serving on matched NVIDIA cards. SGLang's RadixAttention prefix cache makes it 1.3-1.7x faster on agent workloads with stable system prompts. llama.cpp handles mixed cards via --tensor-split and is the only path for asymmetric pairs. ROCm operators read the AMD path (linked below).

One choice. Run with it for a week before evaluating an alternative. Switching runtimes mid-rollout doubles the debugging surface.

When this is done you should have

vLLM (production tensor parallel), SGLang (agent loops with prefix cache), or llama.cpp (pipeline parallel, mixed-GPU). Decision rationale recorded.

Read nextvLLM SGLang Inference runtimes map

Bring up tensor parallel and verify both cards are fed

The trap: vLLM happily starts with TP=1 even when you have two cards, and you'll see one card pegged and one idle. Set --tensor-parallel-size 2 explicitly. During a real workload, nvidia-smi should show both cards above 70% utilization; anything less means the workload isn't large enough or the interconnect is bottlenecking you.

When this is done you should have

A model loaded with tensor-parallel-size 2 (or layer-split for mixed cards), both GPUs at >70% utilization during sustained inference.

Read nextDual 4090 workstation stack vLLM operational review

Tune for your interconnect

NCCL_P2P_DISABLE=1 vs unset can be a 30% difference on consumer cards (RTX 30 / 40 / 50 series have nerfed P2P in the driver). NCCL_DEBUG=INFO during startup tells you which transport is being used. Bench with and without P2P; the answer is rarely the same on consumer vs datacenter cards.

For NVLink-paired cards (3090, A100, H100, RTX A6000), P2P and NVLink should both be active and you should see bandwidth in the hundreds of GB/s on nccl-tests.

When this is done you should have

NCCL configured correctly for your topology. P2P enabled where supported. A noticeable throughput delta between tuned and untuned config.

Read nextMulti-GPU guide Quad 3090 stack

Pick a model that justifies the second card

Two 24GB cards = 48GB effective VRAM (with overhead). That gets you a 70B model in Q4-Q5, or higher-throughput serving for a 32B with bigger context, or longer context on a 14B. Don't keep running 7B on dual GPUs — that's multi-GPU overhead with single-GPU work.

When this is done you should have

A 70B-class model loaded across both GPUs that wouldn't have fit on one. Or a higher-throughput config for a smaller model.

Read nextLlama 3.3 70B Instruct Models index

Profile sustained throughput honestly

"It feels fast" is not a benchmark. Use the runtime's own benchmark tools or a load-tester to measure: TTFT under load, sustained output tokens per second at concurrency 1, 4, 8, 16. Compare against the same model on a single GPU (or the next-larger model). This is the data that justifies the second card.

If you can't justify it, you have one of two problems: wrong model size, or wrong parallelism strategy. Both are fixable.

When this is done you should have

Real numbers: input tok/s, output tok/s, time-to-first-token, batch size at saturation. Compared against single-GPU baseline.

Read nextBenchmarks index Reproduction guide

Restart-proof the deployment

Multi-GPU deployments have more failure modes than single- GPU: one card can hang, one can OOM, NCCL can deadlock. A robust deployment unit detects these, restarts cleanly, and emits a metric. If your service "works fine when you launch it manually," it isn't a service yet.

When this is done you should have

systemd units (or a Docker compose) that bring the multi-GPU service up cleanly on boot. Failover behavior tested by killing one card.

Read nextHomelab path Local AI maintenance

Next recommended step

The full topology + runtime cross-cut: when to pick which combination, with operator-grade tradeoffs.

Read the multi-GPU 2026 guide

OrBrowse the homelab path Browse the AMD ROCm path