Single-node multi-GPUNVLink-Switchexpert

What runs on vLLM tensor-parallel 4× H100 80GB workstation?

Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.

At a glance

Effective VRAM

300 / 320 GB

Not pooled

Speed penalty

~5%

vs ideal single-card

Recommended runtime

vllm

tensor parallel

Setup difficulty

expert

~2800W peak

Models fit

Borderline

Not practical

Deployment recipe

4× H100 SXM tensor-parallel workstation →

DGX-class deployment recipe with vLLM TP-4, FP8 transformer engine, NVLink-Switch verification, and cost-realism vs cloud rental.

Memory budget

Total VRAM

320 GB

Effective for inference

300 GB

94% of total

Not pooled

4× H100 80GB SXM with NVLink-Switch fabric is the rare configuration where total VRAM ≈ effective VRAM. The NVLink-Switch (DGX-H100 chassis) provides full-mesh 900 GB/s bidirectional bandwidth between all 4 cards, allowing tensor parallelism with negligible cross-card overhead. Effective ceiling for inference is ~300 GB — total minus ~5 GB per card for activations, KV cache, and runtime overhead at 32K context. This is the configuration where Qwen 3.5 235B-A17B at FP8 fits with full headroom, or DeepSeek V4 Pro at AWQ-INT4 fits comfortably.

Why total VRAM is not the whole story

NVLink-Switch fabric (900 GB/s mesh) makes tensor-parallel cross-card overhead near-zero. Effective 300 GB of total 320 GB after activations + KV cache.

See the multi-GPU guide for the full math + tradeoffs.

Topology

single-node-multi-gpu

Interconnect

nvlink-switch~900 GB/s

Component count

4 units

Components

4×nvidia-h100-sxm

Recommended runtime

vllm

Also: sglang, tensorrt-llm

Recommended split strategy

tensor-parallel

Also: expert-routing

Setup difficulty

expert

~2800W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

DeepSeek V4 Flash (284B MoE)

Fits

284B·Q4_K_M → 192 GB·64% of effective VRAM·~5% speed penalty vs ideal

Llama 3.1 Nemotron Ultra 253B

Fits

253B·Q4_K_M → 160 GB·53% of effective VRAM·~5% speed penalty vs ideal

DeepSeek Coder V2 236B

Fits

236B·Q4_K_M → 160 GB·53% of effective VRAM·~5% speed penalty vs ideal

DeepSeek V2.5 236B

Fits

236B·Q4_K_M → 160 GB·53% of effective VRAM·~5% speed penalty vs ideal

Qwen 3 235B-A22B

Fits

235B·Q4_K_M → 160 GB·53% of effective VRAM·~5% speed penalty vs ideal

GLM-5

Fits

200B·Q4_K_M → 140 GB·47% of effective VRAM·~5% speed penalty vs ideal

Kimi K1.5

Fits

200B·AWQ-INT4 → 140 GB·47% of effective VRAM·~5% speed penalty vs ideal

GLM-5 Pro

Fits

144B·AWQ-INT4 → 96 GB·32% of effective VRAM·~5% speed penalty vs ideal

Mixtral 8x22B Instruct

Fits

141B·Q4_K_M → 96 GB·32% of effective VRAM·~5% speed penalty vs ideal

WizardLM-2 8x22B

Fits

141B·Q4_K_M → 96 GB·32% of effective VRAM·~5% speed penalty vs ideal

DBRX Base

Fits

132B·Q4_K_M → 96 GB·32% of effective VRAM·~5% speed penalty vs ideal

DBRX Instruct

Fits

132B·AWQ-INT4 → 96 GB·32% of effective VRAM·~5% speed penalty vs ideal

Mistral Large 2 (123B)

Fits

123B·Q4_K_M → 88 GB·29% of effective VRAM·~5% speed penalty vs ideal

Nemotron 3 Super (120B-A12B)

Fits

120B·Q4_K_M → 84 GB·28% of effective VRAM·~5% speed penalty vs ideal

Llama 4 Scout

Fits

109B·Q4_K_M → 80 GB·27% of effective VRAM·~5% speed penalty vs ideal

Command R+ 104B

Fits

104B·Q4_K_M → 70 GB·23% of effective VRAM·~5% speed penalty vs ideal

Command R+ (Aug 2024)

Fits

104B·AWQ-INT4 → 72 GB·24% of effective VRAM·~5% speed penalty vs ideal

Llama 3.2 90B Vision

Fits

90B·AWQ-INT4 → 64 GB·21% of effective VRAM·~5% speed penalty vs ideal

Llama 3.2 90B Vision Instruct

Fits

90B·Q4_K_M → 60 GB·20% of effective VRAM·~5% speed penalty vs ideal

InternVL 2.5 78B

Fits

78B·Q4_K_M → 52 GB·17% of effective VRAM·~5% speed penalty vs ideal

Qwen 2.5 72B Instruct

Fits

72B·Q4_K_M → 48 GB·16% of effective VRAM·~5% speed penalty vs ideal

Molmo 72B

Fits

72B·Q4_K_M → 48 GB·16% of effective VRAM·~5% speed penalty vs ideal

Qwen 2.5-VL 72B

Fits

72B·AWQ-INT4 → 48 GB·16% of effective VRAM·~5% speed penalty vs ideal

Qwen 3 72B

Fits

72B·AWQ-INT4 → 48 GB·16% of effective VRAM·~5% speed penalty vs ideal

Borderline (5)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

Llama 4 405B

Borderline

405B·AWQ-INT4 → 280 GB·93% of effective VRAM·~5% speed penalty vs ideal

Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 4 Maverick

Borderline

400B·Q4_K_M → 280 GB·93% of effective VRAM·~5% speed penalty vs ideal

Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Jamba 1.5 Large

Borderline

398B·Q4_K_M → 260 GB·87% of effective VRAM·~5% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Qwen 3.5 235B-A17B (MoE)

Borderline

397B·Q4_K_M → 256 GB·85% of effective VRAM·~5% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Hunyuan Large 389B MoE

Borderline

389B·Q4_K_M → 260 GB·87% of effective VRAM·~5% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Not practical (7)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)

Not practical

1600B·Q4_K_M → 1024 GB·341% of effective VRAM·~5% speed penalty vs ideal