What runs on vLLM tensor-parallel 4× H100 80GB workstation?
Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.
DGX-class deployment recipe with vLLM TP-4, FP8 transformer engine, NVLink-Switch verification, and cost-realism vs cloud rental.
4× H100 80GB SXM with NVLink-Switch fabric is the rare configuration where total VRAM ≈ effective VRAM. The NVLink-Switch (DGX-H100 chassis) provides full-mesh 900 GB/s bidirectional bandwidth between all 4 cards, allowing tensor parallelism with negligible cross-card overhead. Effective ceiling for inference is ~300 GB — total minus ~5 GB per card for activations, KV cache, and runtime overhead at 32K context. This is the configuration where Qwen 3.5 235B-A17B at FP8 fits with full headroom, or DeepSeek V4 Pro at AWQ-INT4 fits comfortably.
NVLink-Switch fabric (900 GB/s mesh) makes tensor-parallel cross-card overhead near-zero. Effective 300 GB of total 320 GB after activations + KV cache.
See the multi-GPU guide for the full math + tradeoffs.
Topology
Models that fit comfortably (24)
Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.
Borderline (5)
Fits but with little headroom. KV cache for long context may not fit; verify before deployment.
Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.
Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.
Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.
Not practical (7)
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Benchmark opportunities
estimates, not measurementsPending benchmark targets for this combo. Once measured, results land in the catalog as benchmarks.
Frontier MoE on the datacenter reference rig. FP8 fits comfortably in 4× 80GB; expect strong per-stream decode and dramatic concurrency lift via SGLang RadixAttention.
DeepSeek V4 Flash is the throughput-tuned V4 sibling. 80B/12B-active on 4× H100 should produce strongest open-weight tok/s in 2026.
Going deeper
- Full combo detail page — operational review with failure modes and runtime matrix.
- Multi-GPU buying guide — when multi-GPU is worth it and when it isn't.
- Will-it-run home — single-card check + custom builds.