4× H100 SXM tensor-parallel workstation (May 2026) — frontier MoE serving reference

What this stack accomplishes

4× H100 SXM5 with NVLink-Switch fabric is the datacenter reference for frontier-MoE serving in 2026. The use cases:

Frontier-MoE production: Qwen 3.5 235B-A17B, DeepSeek V4 Pro, Llama 4 Maverick
High-concurrency multi-tenant inference (16+ concurrent agent loops)
Long-context (1M-token) workloads where KV-cache budget matters
Research / training workloads

What it's NOT for:

Hobby use — quad-3090 covers 90% of hobby workloads at 5% of the cost
Single-user latency — TP-4 is slower per-stream than 2× TP-2 replicas
Development iteration — TensorRT-LLM recompile-per-config friction is real
Edge deployment — datacenter rack required

Hardware required

DGX-H100 8-GPU chassis OR equivalent 4× H100 SXM build (Supermicro AS-4124GO-NART, ASUS ESC8000-E11) · 4× H100 80GB SXM5 with NVLink-Switch · 2TB DDR5 ECC · 8TB+ NVMe (model weights + checkpoints) · 25 GbE or InfiniBand for cluster expansion · datacenter rack with 30 kW power + tier-3 cooling · Ubuntu 22.04 LTS with NVIDIA enterprise driver stack

Components — what to install and why

The stack

01
HardwareGPUs (4× SXM5 in NVLink-Switch fabric)
nvidia-h100-sxm
H100 SXM5 with the NVLink-Switch chassis is the only consumer-tier-or-below configuration where total VRAM ≈ effective VRAM. The 900 GB/s mesh between all 4 cards makes tensor-parallel-4 essentially free vs the PCIe penalty consumer multi-GPU pays.
02
ToolInference engine (TP-4 + FP8 + MTP)
vllm
vLLM is the production reference. --tensor-parallel-size 4 with FP8 quants extracts the H100's transformer engine; multi-token-prediction (MTP) head for V4 Pro gives ~1.8× decode throughput. Set --gpu-memory-utilization 0.95 — H100s have generous memory bandwidth headroom.
03
ToolAgent serving (RadixAttention prefix cache)
sglang
SGLang's RadixAttention compounds harder than vLLM at organizational concurrency — agent loops with stable system prompts see prefix-cache hit rates >70%, multiplying effective throughput. Pick over vLLM when serving 16+ concurrent agent harnesses.
04
ToolPeak-throughput runtime (when stable config)
tensorrt-llm
TensorRT-LLM extracts an additional 15-25% throughput vs vLLM at the cost of recompile-per-config friction. Use when model + quant + batch size are stable for production deployment; not for development iteration.
05
ModelFrontier MoE (235B/17B-active)
qwen-3.5-235b-a17b
Qwen 3.5 frontier MoE at FP8 fits comfortably in 4× H100 80GB. The strongest open-weight multilingual + reasoning model in 2026. Apache 2.0 successor to Qwen 3.
06
ModelFrontier coder + reasoner (MIT license)
deepseek-v4-pro
DeepSeek V4 Pro at FP8 or AWQ-INT4 on 4× H100. The open-weight coding ceiling in 2026. MIT license unblocks deployments that Qwen license blocks.
07
ModelFrontier multimodal MoE
llama-4-maverick
Llama 4 Maverick at AWQ-INT4 fits 4× H100 with multimodal headroom. Native vision-text reasoning + 1M context. Pick when multimodal serving is the requirement.

Step-by-step setup

Assumes Ubuntu 22.04 LTS on a DGX-H100 or equivalent 4× H100 SXM5 chassis.

1. NVIDIA enterprise driver + CUDA toolkit

# Enterprise driver stack — pin precise version for production
sudo apt install -y nvidia-driver-560-server cuda-toolkit-12-6
sudo reboot

# Verify all 4 H100 SXM5 detected with NVLink-Switch
nvidia-smi --query-gpu=name,memory.total --format=csv
# Expected: 4× "NVIDIA H100 80GB HBM3", 81920 MiB each

# Verify NVLink-Switch topology (DGX-class)
nvidia-smi nvlink --status -i 0
# Expected: 18 active NVLink connections (full mesh)
nvidia-smi topo -m
# Expected: NV18 between every GPU pair (NVLink, 18 links)

2. Install vLLM + AutoAWQ + FP8 support

# Production-pinned versions
python3 -m venv /opt/venvs/vllm
source /opt/venvs/vllm/bin/activate
pip install --upgrade pip
pip install vllm==0.7.3 autoawq

# Verify FP8 transformer engine availability
python -c "import torch; print(torch.cuda.get_device_capability(0))"
# Expected: (9, 0) — H100 is compute capability 9.0
# FP8 transformer engine requires sm_90+

3. Serve Qwen 3.5 235B-A17B at FP8

# Pre-pull the FP8 quant
huggingface-cli download neuralmagic/Qwen3.5-235B-A17B-FP8

# Serve with tensor-parallel-4
vllm serve neuralmagic/Qwen3.5-235B-A17B-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --quantization fp8 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --port 8000

On H100 SXM with NVLink-Switch, --gpu-memory-utilization can safely run at 0.95 (vs 0.92 on consumer multi-GPU) — the high-bandwidth mesh keeps inter-rank communication fast enough that KV-cache spikes don't cause OOM.

4. (Alternative) SGLang for high-concurrency agent serving

pip install sglang[all]

# Launch SGLang with RadixAttention prefix cache
python -m sglang.launch_server \
  --model-path neuralmagic/Qwen3.5-235B-A17B-FP8 \
  --tp 4 \
  --quantization fp8 \
  --port 30000 \
  --enable-radixattention

Verifying NVLink-Switch fabric

DGX-class topology should show NV18 between all GPU pairs. Verify with:

nvidia-smi topo -m
# Expected matrix:
#         GPU0    GPU1    GPU2    GPU3
# GPU0    X       NV18    NV18    NV18
# GPU1    NV18    X       NV18    NV18
# GPU2    NV18    NV18    X       NV18
# GPU3    NV18    NV18    NV18    X
#
# NV18 = 18-link NVLink mesh (~900 GB/s aggregate per pair)
# If you see PIX/PXB instead, your chassis is PCIe-only — performance suffers

Expected outcome

vLLM TP-4 with FP8 transformer engine serving Qwen 3.5 235B-A17B at 80-150 tok/s single-stream decode and 800+ tok/s aggregate at 16 concurrent. NVLink-Switch fabric makes tensor-parallel cross-card overhead near-zero. Sustained power: 2.8 kW under load. Plan for $200,000+ all-in including chassis + interconnect.

Production benchmarks on properly-configured DGX-H100:

Qwen 3.5 235B-A17B FP8 single-stream decode: 80-150 tok/s
Aggregate at 16 concurrent: 800-1,500 tok/s
TTFT (1K prompt): 80-120 ms
DeepSeek V4 Pro AWQ-INT4 with MTP: 100-160 tok/s single-stream
Power: 2,800W steady, 3,200W transient peak

Power, cooling, and rack assumptions

30 kW rack power minimum. 4× H100 SXM5 at 700W each = 2.8 kW for GPUs alone; chassis fans + CPUs + memory bring sustained draw to ~3.5 kW.
Tier-3 datacenter cooling required. Liquid cooling is standard for 24/7 production deployments.
InfiniBand or 25 GbE for cluster expansion. If this is a single-node config, regular 10 GbE is fine for management; expansion to multi-node serving needs IB or 100 GbE.
Datacenter floor space. DGX-H100 is 8U; standard 4×H100 SXM Supermicro is 4U. Plan rack space + adjacent equipment cooling.

Cost realism — should you own or rent?

The honest cost framing for May 2026:

Path	Upfront cost	Ongoing cost (1 year)	Right when…
Cloud H100 rental (on-demand)	$0	~$3-5/hour × usage	You serve <6 hours/day or are still iterating
Cloud reserved (1-year commit)	$0	$80-120k/year	You serve 16+ hours/day at predictable scale
Own DGX-H100	$200k-300k	$15-25k/year power+cooling	You serve 24/7, need data sovereignty, or 3+ year horizon

Most teams should rent. Owning makes sense when:

Data-sovereignty requirements force on-prem deployment
Inference volume sustained above $10-15k/month (rental break-even)
Regulatory frameworks (HIPAA, defense, finance) make cloud unviable
You have datacenter operations staff already; the marginal cost is small

Failure modes you'll hit

Cooling under-spec. SXM5 modules require chassis-integrated liquid or aggressive air cooling; off-the-shelf chassis with PCIe H100 NVL ≠ SXM in cooling design. Verify thermal envelope before commitment.
CUDA / driver / vLLM version mismatch. H100 features (FP8 transformer engine, MIG partitioning) require precise stack alignment. Pin all versions in production config management.
NVLink-Switch firmware bugs. Rare but real — switch fabric issues produce subtle cross-card corruption that's hard to diagnose. Stay on NVIDIA-validated firmware versions.
MIG partition complexity. Multi-Instance GPU mode is powerful but complex; misconfiguration produces silent throughput loss. Don't enable MIG unless you specifically need partitioning.
Power transient over-current. 4× 700W = 2,800W sustained; transients can hit 4,000W. PDU and UPS sizing is non-trivial — work with your facility ops.
TP-4 single-stream stall. Counter-intuitively, 4-rank TP is slower per-stream than 2-rank because the all-reduce gets less efficient. For latency-critical workloads, run 2× TP-2 replicas instead.
FP8 quant pipeline mismatches. Some FP8 quants from HuggingFace require specific TensorRT-LLM versions; verify before assuming a quant loads in vLLM.

Troubleshooting

Symptom: TP-4 throughput is below expected. Verify NVLink-Switch with nvidia-smi topo -m. PIX/PXB instead of NV18 means PCIe fallback — your chassis is missing the switch.

Symptom: random NCCL timeouts under load. Set NCCL_TIMEOUT=600and increase the threadpool. NCCL on NVLink-Switch is normally rock-solid; timeouts indicate firmware or driver mismatch.

Symptom: FP8 throughput same as BF16. The FP8 transformer engine isn't being used. Verify --quantization fp8 is set and CUDA capability reports 9.0.

Variations and alternatives

2× H100 SXM: half the cost, fits 70-100B models comfortably. Use when frontier-MoE isn't the target.

Cloud-burst hybrid: own 1× H100 for steady-state, rent 4× H100 cloud for peak. Lowers capital cost while preserving on-demand scale.

Sub-frontier alternatives:

Quad RTX 3090 workstation — prosumer ceiling at 5% of the cost. Fits 100B-class but not frontier-MoE.
Multi-machine Apple cluster — Mac-based path for 200B+ envelope at 5% the power budget.
Distributed inference homelab — multi-node consumer-GPU pattern.

Who should avoid this build

Individuals / small teams — H100 cloud is dramatically cheaper for sporadic workloads.
Hobby projects — quad-3090 covers 90% of hobby use cases at 5% of the cost.
CUDA-version-sensitive workloads — H100 requires CUDA 12+; older PyTorch / framework code may break.
Anyone without datacenter operations — owning DGX-class hardware without ops staff is expensive and frustrating.

Going deeper

H100 TP combo detail — operator-grade review.
Will-it-run for this combo — fit verdict for every catalog model.
Multi-GPU buying guide — full decision framework.
Distributed inference systems — architectural depth.
Quad RTX 3090 stack — the prosumer alternative at 5% of the cost.