fatalEditorialReviewed May 2026

Tensor parallelism crash — fix multi-GPU NCCL + topology issues

Multi-GPU tensor-parallel crashes trace to NCCL backend issues (PCIe topology, missing peer access), insufficient GPU pair memory, or tensor-parallel-size not matching GPU count. Diagnose with NCCL_DEBUG=INFO.

vLLMExLlamaV2TensorRT-LLMDeepSpeedPyTorch DDPNCCL
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

TP-size doesn't match available GPU count

Diagnose

vLLM crashes with 'tensor-parallel-size 4 but only 2 GPUs visible.' `nvidia-smi` shows the count.

Fix

Set `--tensor-parallel-size` to match `nvidia-smi` count. For dual-GPU: `--tensor-parallel-size 2`. Verify with `CUDA_VISIBLE_DEVICES=0,1` to explicitly pin.

#2

PCIe topology blocks peer access (NCCL falls back to host memory)

Diagnose

Multi-GPU works but slow. `nvidia-smi topo -m` shows `SYS` between GPUs (no direct peer access). NCCL traffic goes through CPU.

Fix

If GPUs are on different CPU sockets or behind different PCIe switches, NCCL can't peer-access. Move both GPUs to the same CPU's PCIe lanes. For consumer dual-GPU, ensure both are PCIe 4.0 x8 or x16 from the same CPU.

#3

Combined VRAM insufficient for the model

Diagnose

TP works on smaller models but fails on 70B+. Each GPU needs to hold its slice + KV cache + activations. Two 16 GB cards can't run 70B Q4.

Fix

Smaller model. Smaller quant. Or upgrade to 24+ GB GPUs. For 70B Q4: dual 3090 (48 GB combined) is the minimum. For FP16 70B: H100 80 GB or 4× 24 GB cards.

#4

NCCL version mismatch with PyTorch / runtime

Diagnose

Crash with 'NCCL version mismatch' or 'function not found.' NCCL bundled with PyTorch differs from the system one.

Fix

Reinstall PyTorch from official wheels (which bundle the right NCCL): `pip install --upgrade --force-reinstall torch --index-url https://download.pytorch.org/whl/cu124`. Or unset `NCCL_HOME` to use bundled.

#5

GPU 0 not used (CUDA_VISIBLE_DEVICES misconfigured)

Diagnose

vLLM expects to use GPU 0 by default. If only GPUs 1+2 are exported, the framework sees them as 0+1 but errors on internal ID expectations.

Fix

Explicit: `CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --tensor-parallel-size 2 ...`. Frameworks then see two GPUs as 0+1.

#6

P2P (peer-to-peer) disabled in BIOS

Diagnose

Multi-GPU works but degraded. `nvidia-smi topo -m` shows `PXB` instead of `PIX` between GPUs. BIOS PCIe setting prevents direct P2P.

Fix

Enable 'Above 4G Decoding' in BIOS. Some boards also have 'Re-Size BAR Support' which helps. Reboot, verify topo improves.

Frequently asked questions

How much faster is tensor parallelism vs single-GPU?

vLLM / ExLlamaV2 typically scale 1.7-1.9x on dual-GPU for inference (memory-bound workload). Training scales closer to 1.95x. The non-linearity comes from communication overhead — tensor-parallel sends activations across GPUs every layer.

Do I need NVLink for tensor parallelism?

No — works on PCIe 4.0 x16. NVLink helps but isn't required. Most consumer dual-GPU rigs (4090, 5090, 3090) use PCIe and work fine. NVLink was discontinued on 4090+ consumer cards anyway.

Can I tensor-parallel different GPU models?

Technically yes, but the slow card bottlenecks the fast one. Mixing 4090 + 3090 means 3090 throughput on tensor-parallel workloads. Match cards for production setups.

Related troubleshooting

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: