Single-node multi-GPUNVLink-Switchexpert

vLLM tensor-parallel 4× H100 80GB workstation

Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.

By Fredoline Eruo · Reviewed 2026-05-06

Try this build in the custom builder

Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.

Open in builder →

Memory budget

Total VRAM

320 GB

Effective for inference

300 GB

94% of total

Not pooled

4× H100 80GB SXM with NVLink-Switch fabric is the rare configuration where total VRAM ≈ effective VRAM. The NVLink-Switch (DGX-H100 chassis) provides full-mesh 900 GB/s bidirectional bandwidth between all 4 cards, allowing tensor parallelism with negligible cross-card overhead. Effective ceiling for inference is ~300 GB — total minus ~5 GB per card for activations, KV cache, and runtime overhead at 32K context. This is the configuration where Qwen 3.5 235B-A17B at FP8 fits with full headroom, or DeepSeek V4 Pro at AWQ-INT4 fits comfortably.

Topology

single-node-multi-gpu

Interconnect

nvlink-switch~900 GB/s

Component count

4 units

Components

4×nvidia-h100-sxm

Recommended runtimes

Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.

vLLM SGLang TensorRT-LLM

Supported split strategies

How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.

Tensor parallelExpert routing

Why this combo

4× H100 80GB SXM is the datacenter production reference for local-AI serving. The use cases:

Frontier MoE production serving (Qwen 3.5, DeepSeek V4, Llama 4)
High-concurrency 70-100B inference for organizations
Research / training workloads
Multi-tenant agent serving at scale

Honest framing: this is enterprise-tier hardware. For individuals, hosted inference (Together, Fireworks, Anthropic API) is dramatically cheaper. The case for self-hosting at this tier is data sovereignty, custom models, or extreme inference volume.

Runtime compatibility

vLLM ✓ excellent. The reference deployment. --tensor-parallel-size 4 with FP8 or AWQ-INT4 quants.
SGLang ✓ excellent. Particularly strong for agent serving with stable system prompts.
TensorRT-LLM ✓ best-in-class throughput at the cost of recompile-per-config friction.
Ray Serve ✓ for multi-replica patterns at scale.

Comparison vs alternatives

Metric	4× H100 SXM workstation	4× RTX 4090	DGX H200 8-GPU
Effective VRAM	300 GB	~92 GB	1100 GB+
FP8 throughput	Top-tier	Limited	Top-tier
Tokens/sec (Qwen 3.5 235B INT4)	80-150	N/A (doesn't fit)	200-400
Cost	$200,000+	$5,000-7,500	$400,000+
Production readiness	Yes	No	Yes

This is the floor for serious frontier-MoE production serving. Below this tier (4× 4090, quad 3090), the model envelope doesn't reach frontier-tier targets at any practical quant.

Cloud alternative

For most teams, hosted H100 (RunPod, Lambda, CoreWeave) is the right path until inference volume exceeds ~$10k/month sustained.

/stacks/distributed-inference-homelab — multi-node alternative
/systems/distributed-inference — architectural depth
/tools/vllm — runtime operational review
/guides/running-local-ai-on-multiple-gpus-2026 — buying guide

Best model classes

Frontier MoE serving — Qwen 3.5 235B-A17B, DeepSeek V4 Pro, Llama 4 Maverick all fit at FP8 or INT4 with full production headroom.
High-concurrency 70-100B serving — vLLM serves 32+ concurrent agent loops at >50 tok/s each.
Long-context 1M-token workloads — KV cache budget is generous at this VRAM tier.

This is the production-default deployment for organizations serving inference at scale.

What this combo is bad at

Cost-constrained deployment — $200,000+ all-in for the chassis + 4× H100 SXM. Only justified at significant production scale.
Single-stream latency — tensor-parallel-4 doesn't beat tensor-parallel-2 for single-user latency; you only win on aggregate throughput.
Edge deployment — datacenter rack required.

Who should avoid this

Individual users / small teams — H100 cloud (RunPod, Lambda) is cheaper for sporadic workloads.
Hobby projects — quad RTX 3090 covers 90% of hobby use cases at 5% of the cost.
CUDA-version-sensitive workloads — H100 requires CUDA 12+ which may break older PyTorch / framework code.

Power & thermal

~2800W peak

DGX-class chassis. Datacenter rack required; not viable for office or home deployment. Liquid cooling optional but standard for 24/7 deployment.

Reliability

H100 SXM has the strongest reliability track record in production AI serving. NVIDIA enterprise warranty + datacenter SLAs. Failure modes are dominated by environmental factors (power quality, cooling) rather than card failure.

Recommended OS

Ubuntu 22.04 LTS with NVIDIA enterprise driver stack.

Operator warning — failure modes

Failure modes specific to 4× H100 SXM workstation

Cooling under-spec. SXM modules require chassis-integrated liquid or aggressive air cooling; off-the-shelf chassis with PCIe H100 NVL ≠ SXM in cooling design. Verify thermal envelope before committing.
CUDA / driver / vLLM version mismatch. H100 features (FP8 transformer engine, MIG partitioning) require precise stack alignment. Pin versions in production.
NVLink-Switch firmware bugs. Rare but real — switch fabric issues produce subtle cross-card corruption that's hard to diagnose. Stay on NVIDIA-validated firmware.
MIG partition complexity. Multi-Instance GPU mode is powerful but complex; misconfiguration produces silent throughput loss.
Power delivery transients. 4× 700W = 2800W sustained; transients can hit 4000W. PDU and UPS sizing is non-trivial.
Tensor-parallel-4 single-stream stall. Counter-intuitively, 4-rank tensor-parallel is slower per-stream than 2-rank because the all-reduce gets less efficient. For latency-critical single-user workloads, run 2× tensor-parallel-2 replicas instead.

Closest alternative

Quad Rtx 3090 →

If you can't justify the $200k+ datacenter spend, quad-3090 covers 100B-class at 5% of the cost. H100 wins on reliability + frontier-MoE; 3090 wins on price-to-capability ratio.

Featured in stack

4× H100 SXM tensor-parallel workstation →

DGX-class deployment recipe with vLLM TP-4, FP8 transformer engine, NVLink-Switch verification, and cost-realism vs cloud rental.

Benchmark opportunities

Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.

4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)

qwen-3.5-235b-a17b

pending

Estimate: 60-90 tok/s decode (single stream)

Frontier MoE on the datacenter reference rig. FP8 fits comfortably in 4× 80GB; expect strong per-stream decode and dramatic concurrency lift via SGLang RadixAttention.

4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)

deepseek-v4-flash