vLLM tensor-parallel 4× H100 80GB workstation
Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.
Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.
4× H100 80GB SXM with NVLink-Switch fabric is the rare configuration where total VRAM ≈ effective VRAM. The NVLink-Switch (DGX-H100 chassis) provides full-mesh 900 GB/s bidirectional bandwidth between all 4 cards, allowing tensor parallelism with negligible cross-card overhead. Effective ceiling for inference is ~300 GB — total minus ~5 GB per card for activations, KV cache, and runtime overhead at 32K context. This is the configuration where Qwen 3.5 235B-A17B at FP8 fits with full headroom, or DeepSeek V4 Pro at AWQ-INT4 fits comfortably.
Topology
Recommended runtimes
Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.
Supported split strategies
How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.
Why this combo
4× H100 80GB SXM is the datacenter production reference for local-AI serving. The use cases:
- Frontier MoE production serving (Qwen 3.5, DeepSeek V4, Llama 4)
- High-concurrency 70-100B inference for organizations
- Research / training workloads
- Multi-tenant agent serving at scale
Honest framing: this is enterprise-tier hardware. For individuals, hosted inference (Together, Fireworks, Anthropic API) is dramatically cheaper. The case for self-hosting at this tier is data sovereignty, custom models, or extreme inference volume.
Runtime compatibility
- vLLM ✓ excellent. The reference deployment.
--tensor-parallel-size 4with FP8 or AWQ-INT4 quants. - SGLang ✓ excellent. Particularly strong for agent serving with stable system prompts.
- TensorRT-LLM ✓ best-in-class throughput at the cost of recompile-per-config friction.
- Ray Serve ✓ for multi-replica patterns at scale.
Comparison vs alternatives
| Metric | 4× H100 SXM workstation | 4× RTX 4090 | DGX H200 8-GPU |
|---|---|---|---|
| Effective VRAM | 300 GB | ~92 GB | 1100 GB+ |
| FP8 throughput | Top-tier | Limited | Top-tier |
| Tokens/sec (Qwen 3.5 235B INT4) | 80-150 | N/A (doesn't fit) | 200-400 |
| Cost | $200,000+ | $5,000-7,500 | $400,000+ |
| Production readiness | Yes | No | Yes |
This is the floor for serious frontier-MoE production serving. Below this tier (4× 4090, quad 3090), the model envelope doesn't reach frontier-tier targets at any practical quant.
Cloud alternative
For most teams, hosted H100 (RunPod, Lambda, CoreWeave) is the right path until inference volume exceeds ~$10k/month sustained.
Related
- /stacks/distributed-inference-homelab — multi-node alternative
- /systems/distributed-inference — architectural depth
- /tools/vllm — runtime operational review
- /guides/running-local-ai-on-multiple-gpus-2026 — buying guide
Best model classes
- Frontier MoE serving — Qwen 3.5 235B-A17B, DeepSeek V4 Pro, Llama 4 Maverick all fit at FP8 or INT4 with full production headroom.
- High-concurrency 70-100B serving — vLLM serves 32+ concurrent agent loops at >50 tok/s each.
- Long-context 1M-token workloads — KV cache budget is generous at this VRAM tier.
This is the production-default deployment for organizations serving inference at scale.
What this combo is bad at
- Cost-constrained deployment — $200,000+ all-in for the chassis + 4× H100 SXM. Only justified at significant production scale.
- Single-stream latency — tensor-parallel-4 doesn't beat tensor-parallel-2 for single-user latency; you only win on aggregate throughput.
- Edge deployment — datacenter rack required.
Who should avoid this
- Individual users / small teams — H100 cloud (RunPod, Lambda) is cheaper for sporadic workloads.
- Hobby projects — quad RTX 3090 covers 90% of hobby use cases at 5% of the cost.
- CUDA-version-sensitive workloads — H100 requires CUDA 12+ which may break older PyTorch / framework code.
DGX-class chassis. Datacenter rack required; not viable for office or home deployment. Liquid cooling optional but standard for 24/7 deployment.
H100 SXM has the strongest reliability track record in production AI serving. NVIDIA enterprise warranty + datacenter SLAs. Failure modes are dominated by environmental factors (power quality, cooling) rather than card failure.
Ubuntu 22.04 LTS with NVIDIA enterprise driver stack.
Failure modes specific to 4× H100 SXM workstation
- Cooling under-spec. SXM modules require chassis-integrated liquid or aggressive air cooling; off-the-shelf chassis with PCIe H100 NVL ≠ SXM in cooling design. Verify thermal envelope before committing.
- CUDA / driver / vLLM version mismatch. H100 features (FP8 transformer engine, MIG partitioning) require precise stack alignment. Pin versions in production.
- NVLink-Switch firmware bugs. Rare but real — switch fabric issues produce subtle cross-card corruption that's hard to diagnose. Stay on NVIDIA-validated firmware.
- MIG partition complexity. Multi-Instance GPU mode is powerful but complex; misconfiguration produces silent throughput loss.
- Power delivery transients. 4× 700W = 2800W sustained; transients can hit 4000W. PDU and UPS sizing is non-trivial.
- Tensor-parallel-4 single-stream stall. Counter-intuitively, 4-rank tensor-parallel is slower per-stream than 2-rank because the all-reduce gets less efficient. For latency-critical single-user workloads, run 2× tensor-parallel-2 replicas instead.
Quad Rtx 3090 →
If you can't justify the $200k+ datacenter spend, quad-3090 covers 100B-class at 5% of the cost. H100 wins on reliability + frontier-MoE; 3090 wins on price-to-capability ratio.
4× H100 SXM tensor-parallel workstation →
DGX-class deployment recipe with vLLM TP-4, FP8 transformer engine, NVLink-Switch verification, and cost-realism vs cloud rental.
Benchmark opportunities
Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.
4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)
qwen-3.5-235b-a17bFrontier MoE on the datacenter reference rig. FP8 fits comfortably in 4× 80GB; expect strong per-stream decode and dramatic concurrency lift via SGLang RadixAttention.
4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)
deepseek-v4-flashDeepSeek V4 Flash is the throughput-tuned V4 sibling. 80B/12B-active on 4× H100 should produce strongest open-weight tok/s in 2026.
Going deeper
- All hardware combinations — browse other multi-GPU and multi-machine setups.
- Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
- Distributed inference systems — architectural depth.