4× H100 SXM tensor-parallel workstation — frontier MoE serving reference
DGX-class 4× H100 80GB SXM with NVLink-Switch fabric. ~300 GB effective VRAM. vLLM tensor-parallel-4 + FP8 + MTP for frontier MoE production serving. The datacenter reference — overkill for hobby, the right answer for organizations.
What this stack accomplishes
4× H100 SXM5 with NVLink-Switch fabric is the datacenter reference for frontier-MoE serving in 2026. The use cases:
- Frontier-MoE production: Qwen 3.5 235B-A17B, DeepSeek V4 Pro, Llama 4 Maverick
- High-concurrency multi-tenant inference (16+ concurrent agent loops)
- Long-context (1M-token) workloads where KV-cache budget matters
- Research / training workloads
What it's NOT for:
- Hobby use — quad-3090 covers 90% of hobby workloads at 5% of the cost
- Single-user latency — TP-4 is slower per-stream than 2× TP-2 replicas
- Development iteration — TensorRT-LLM recompile-per-config friction is real
- Edge deployment — datacenter rack required
Hardware required
DGX-H100 8-GPU chassis OR equivalent 4× H100 SXM build (Supermicro AS-4124GO-NART, ASUS ESC8000-E11) · 4× H100 80GB SXM5 with NVLink-Switch · 2TB DDR5 ECC · 8TB+ NVMe (model weights + checkpoints) · 25 GbE or InfiniBand for cluster expansion · datacenter rack with 30 kW power + tier-3 cooling · Ubuntu 22.04 LTS with NVIDIA enterprise driver stack
Components — what to install and why
- 01HardwareGPUs (4× SXM5 in NVLink-Switch fabric)nvidia-h100-sxm
H100 SXM5 with the NVLink-Switch chassis is the only consumer-tier-or-below configuration where total VRAM ≈ effective VRAM. The 900 GB/s mesh between all 4 cards makes tensor-parallel-4 essentially free vs the PCIe penalty consumer multi-GPU pays.
- 02ToolInference engine (TP-4 + FP8 + MTP)vllm
vLLM is the production reference. --tensor-parallel-size 4 with FP8 quants extracts the H100's transformer engine; multi-token-prediction (MTP) head for V4 Pro gives ~1.8× decode throughput. Set --gpu-memory-utilization 0.95 — H100s have generous memory bandwidth headroom.
- 03ToolAgent serving (RadixAttention prefix cache)sglang
SGLang's RadixAttention compounds harder than vLLM at organizational concurrency — agent loops with stable system prompts see prefix-cache hit rates >70%, multiplying effective throughput. Pick over vLLM when serving 16+ concurrent agent harnesses.
- 04ToolPeak-throughput runtime (when stable config)tensorrt-llm
TensorRT-LLM extracts an additional 15-25% throughput vs vLLM at the cost of recompile-per-config friction. Use when model + quant + batch size are stable for production deployment; not for development iteration.
- 05ModelFrontier MoE (235B/17B-active)qwen-3.5-235b-a17b
Qwen 3.5 frontier MoE at FP8 fits comfortably in 4× H100 80GB. The strongest open-weight multilingual + reasoning model in 2026. Apache 2.0 successor to Qwen 3.
- 06ModelFrontier coder + reasoner (MIT license)deepseek-v4-pro
DeepSeek V4 Pro at FP8 or AWQ-INT4 on 4× H100. The open-weight coding ceiling in 2026. MIT license unblocks deployments that Qwen license blocks.
- 07ModelFrontier multimodal MoEllama-4-maverick
Llama 4 Maverick at AWQ-INT4 fits 4× H100 with multimodal headroom. Native vision-text reasoning + 1M context. Pick when multimodal serving is the requirement.
Step-by-step setup
Assumes Ubuntu 22.04 LTS on a DGX-H100 or equivalent 4× H100 SXM5 chassis.
1. NVIDIA enterprise driver + CUDA toolkit
# Enterprise driver stack — pin precise version for production
sudo apt install -y nvidia-driver-560-server cuda-toolkit-12-6
sudo reboot
# Verify all 4 H100 SXM5 detected with NVLink-Switch
nvidia-smi --query-gpu=name,memory.total --format=csv
# Expected: 4× "NVIDIA H100 80GB HBM3", 81920 MiB each
# Verify NVLink-Switch topology (DGX-class)
nvidia-smi nvlink --status -i 0
# Expected: 18 active NVLink connections (full mesh)
nvidia-smi topo -m
# Expected: NV18 between every GPU pair (NVLink, 18 links)2. Install vLLM + AutoAWQ + FP8 support
# Production-pinned versions
python3 -m venv /opt/venvs/vllm
source /opt/venvs/vllm/bin/activate
pip install --upgrade pip
pip install vllm==0.7.3 autoawq
# Verify FP8 transformer engine availability
python -c "import torch; print(torch.cuda.get_device_capability(0))"
# Expected: (9, 0) — H100 is compute capability 9.0
# FP8 transformer engine requires sm_90+3. Serve Qwen 3.5 235B-A17B at FP8
# Pre-pull the FP8 quant
huggingface-cli download neuralmagic/Qwen3.5-235B-A17B-FP8
# Serve with tensor-parallel-4
vllm serve neuralmagic/Qwen3.5-235B-A17B-FP8 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--quantization fp8 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 256 \
--port 8000On H100 SXM with NVLink-Switch, --gpu-memory-utilization can safely run at 0.95 (vs 0.92 on consumer multi-GPU) — the high-bandwidth mesh keeps inter-rank communication fast enough that KV-cache spikes don't cause OOM.
4. (Alternative) SGLang for high-concurrency agent serving
pip install sglang[all]
# Launch SGLang with RadixAttention prefix cache
python -m sglang.launch_server \
--model-path neuralmagic/Qwen3.5-235B-A17B-FP8 \
--tp 4 \
--quantization fp8 \
--port 30000 \
--enable-radixattentionVerifying NVLink-Switch fabric
DGX-class topology should show NV18 between all GPU pairs. Verify with:
nvidia-smi topo -m
# Expected matrix:
# GPU0 GPU1 GPU2 GPU3
# GPU0 X NV18 NV18 NV18
# GPU1 NV18 X NV18 NV18
# GPU2 NV18 NV18 X NV18
# GPU3 NV18 NV18 NV18 X
#
# NV18 = 18-link NVLink mesh (~900 GB/s aggregate per pair)
# If you see PIX/PXB instead, your chassis is PCIe-only — performance suffersExpected outcome
vLLM TP-4 with FP8 transformer engine serving Qwen 3.5 235B-A17B at 80-150 tok/s single-stream decode and 800+ tok/s aggregate at 16 concurrent. NVLink-Switch fabric makes tensor-parallel cross-card overhead near-zero. Sustained power: 2.8 kW under load. Plan for $200,000+ all-in including chassis + interconnect.
Production benchmarks on properly-configured DGX-H100:
- Qwen 3.5 235B-A17B FP8 single-stream decode: 80-150 tok/s
- Aggregate at 16 concurrent: 800-1,500 tok/s
- TTFT (1K prompt): 80-120 ms
- DeepSeek V4 Pro AWQ-INT4 with MTP: 100-160 tok/s single-stream
- Power: 2,800W steady, 3,200W transient peak
Power, cooling, and rack assumptions
- 30 kW rack power minimum. 4× H100 SXM5 at 700W each = 2.8 kW for GPUs alone; chassis fans + CPUs + memory bring sustained draw to ~3.5 kW.
- Tier-3 datacenter cooling required. Liquid cooling is standard for 24/7 production deployments.
- InfiniBand or 25 GbE for cluster expansion. If this is a single-node config, regular 10 GbE is fine for management; expansion to multi-node serving needs IB or 100 GbE.
- Datacenter floor space. DGX-H100 is 8U; standard 4×H100 SXM Supermicro is 4U. Plan rack space + adjacent equipment cooling.
Cost realism — should you own or rent?
The honest cost framing for May 2026:
| Path | Upfront cost | Ongoing cost (1 year) | Right when… |
|---|---|---|---|
| Cloud H100 rental (on-demand) | $0 | ~$3-5/hour × usage | You serve <6 hours/day or are still iterating |
| Cloud reserved (1-year commit) | $0 | $80-120k/year | You serve 16+ hours/day at predictable scale |
| Own DGX-H100 | $200k-300k | $15-25k/year power+cooling | You serve 24/7, need data sovereignty, or 3+ year horizon |
Most teams should rent. Owning makes sense when:
- Data-sovereignty requirements force on-prem deployment
- Inference volume sustained above $10-15k/month (rental break-even)
- Regulatory frameworks (HIPAA, defense, finance) make cloud unviable
- You have datacenter operations staff already; the marginal cost is small
Failure modes you'll hit
- Cooling under-spec. SXM5 modules require chassis-integrated liquid or aggressive air cooling; off-the-shelf chassis with PCIe H100 NVL ≠ SXM in cooling design. Verify thermal envelope before commitment.
- CUDA / driver / vLLM version mismatch. H100 features (FP8 transformer engine, MIG partitioning) require precise stack alignment. Pin all versions in production config management.
- NVLink-Switch firmware bugs. Rare but real — switch fabric issues produce subtle cross-card corruption that's hard to diagnose. Stay on NVIDIA-validated firmware versions.
- MIG partition complexity. Multi-Instance GPU mode is powerful but complex; misconfiguration produces silent throughput loss. Don't enable MIG unless you specifically need partitioning.
- Power transient over-current. 4× 700W = 2,800W sustained; transients can hit 4,000W. PDU and UPS sizing is non-trivial — work with your facility ops.
- TP-4 single-stream stall. Counter-intuitively, 4-rank TP is slower per-stream than 2-rank because the all-reduce gets less efficient. For latency-critical workloads, run 2× TP-2 replicas instead.
- FP8 quant pipeline mismatches. Some FP8 quants from HuggingFace require specific TensorRT-LLM versions; verify before assuming a quant loads in vLLM.
Troubleshooting
Symptom: TP-4 throughput is below expected. Verify NVLink-Switch with nvidia-smi topo -m. PIX/PXB instead of NV18 means PCIe fallback — your chassis is missing the switch.
Symptom: random NCCL timeouts under load. Set NCCL_TIMEOUT=600and increase the threadpool. NCCL on NVLink-Switch is normally rock-solid; timeouts indicate firmware or driver mismatch.
Symptom: FP8 throughput same as BF16. The FP8 transformer engine isn't being used. Verify --quantization fp8 is set and CUDA capability reports 9.0.
Variations and alternatives
2× H100 SXM: half the cost, fits 70-100B models comfortably. Use when frontier-MoE isn't the target.
Cloud-burst hybrid: own 1× H100 for steady-state, rent 4× H100 cloud for peak. Lowers capital cost while preserving on-demand scale.
Sub-frontier alternatives:
- Quad RTX 3090 workstation — prosumer ceiling at 5% of the cost. Fits 100B-class but not frontier-MoE.
- Multi-machine Apple cluster — Mac-based path for 200B+ envelope at 5% the power budget.
- Distributed inference homelab — multi-node consumer-GPU pattern.
Who should avoid this build
- Individuals / small teams — H100 cloud is dramatically cheaper for sporadic workloads.
- Hobby projects — quad-3090 covers 90% of hobby use cases at 5% of the cost.
- CUDA-version-sensitive workloads — H100 requires CUDA 12+; older PyTorch / framework code may break.
- Anyone without datacenter operations — owning DGX-class hardware without ops staff is expensive and frustrating.
Going deeper
- H100 TP combo detail — operator-grade review.
- Will-it-run for this combo — fit verdict for every catalog model.
- Multi-GPU buying guide — full decision framework.
- Distributed inference systems — architectural depth.
- Quad RTX 3090 stack — the prosumer alternative at 5% of the cost.