RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Will it run?
  4. /vLLM tensor-parallel 4× H100 80GB workstation
Single-node multi-GPUNVLink-Switchexpert

What runs on vLLM tensor-parallel 4× H100 80GB workstation?

Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.

At a glance
Effective VRAM
300 / 320 GB
Not pooled
Speed penalty
~5%
vs ideal single-card
Recommended runtime
vllm
tensor parallel
Setup difficulty
expert
~2800W peak
24
Models fit
5
Borderline
7
Not practical
Deployment recipe
4× H100 SXM tensor-parallel workstation →

DGX-class deployment recipe with vLLM TP-4, FP8 transformer engine, NVLink-Switch verification, and cost-realism vs cloud rental.

Memory budget
Total VRAM
320 GB
Effective for inference
300 GB
94% of total
Not pooled

4× H100 80GB SXM with NVLink-Switch fabric is the rare configuration where total VRAM ≈ effective VRAM. The NVLink-Switch (DGX-H100 chassis) provides full-mesh 900 GB/s bidirectional bandwidth between all 4 cards, allowing tensor parallelism with negligible cross-card overhead. Effective ceiling for inference is ~300 GB — total minus ~5 GB per card for activations, KV cache, and runtime overhead at 32K context. This is the configuration where Qwen 3.5 235B-A17B at FP8 fits with full headroom, or DeepSeek V4 Pro at AWQ-INT4 fits comfortably.

Why total VRAM is not the whole story

NVLink-Switch fabric (900 GB/s mesh) makes tensor-parallel cross-card overhead near-zero. Effective 300 GB of total 320 GB after activations + KV cache.

See the multi-GPU guide for the full math + tradeoffs.

Topology

Topology
single-node-multi-gpu
Interconnect
nvlink-switch~900 GB/s
Component count
4 units
Components
  • 4×nvidia-h100-sxm
Recommended runtime
vllm
Also: sglang, tensorrt-llm
Recommended split strategy
tensor-parallel
Also: expert-routing
Setup difficulty
expert
~2800W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

DeepSeek V4 Flash (284B MoE)
Fits
284B·Q4_K_M → 192 GB·64% of effective VRAM·~5% speed penalty vs ideal
Llama 3.1 Nemotron Ultra 253B
Fits
253B·Q4_K_M → 160 GB·53% of effective VRAM·~5% speed penalty vs ideal
DeepSeek Coder V2 236B
Fits
236B·Q4_K_M → 160 GB·53% of effective VRAM·~5% speed penalty vs ideal
DeepSeek V2.5 236B
Fits
236B·Q4_K_M → 160 GB·53% of effective VRAM·~5% speed penalty vs ideal
Qwen 3 235B-A22B
Fits
235B·Q4_K_M → 160 GB·53% of effective VRAM·~5% speed penalty vs ideal
GLM-5
Fits
200B·Q4_K_M → 140 GB·47% of effective VRAM·~5% speed penalty vs ideal
Kimi K1.5
Fits
200B·AWQ-INT4 → 140 GB·47% of effective VRAM·~5% speed penalty vs ideal
GLM-5 Pro
Fits
144B·AWQ-INT4 → 96 GB·32% of effective VRAM·~5% speed penalty vs ideal
Mixtral 8x22B Instruct
Fits
141B·Q4_K_M → 96 GB·32% of effective VRAM·~5% speed penalty vs ideal
WizardLM-2 8x22B
Fits
141B·Q4_K_M → 96 GB·32% of effective VRAM·~5% speed penalty vs ideal
DBRX Base
Fits
132B·Q4_K_M → 96 GB·32% of effective VRAM·~5% speed penalty vs ideal
DBRX Instruct
Fits
132B·AWQ-INT4 → 96 GB·32% of effective VRAM·~5% speed penalty vs ideal
Mistral Large 2 (123B)
Fits
123B·Q4_K_M → 88 GB·29% of effective VRAM·~5% speed penalty vs ideal
Nemotron 3 Super (120B-A12B)
Fits
120B·Q4_K_M → 84 GB·28% of effective VRAM·~5% speed penalty vs ideal
Llama 4 Scout
Fits
109B·Q4_K_M → 80 GB·27% of effective VRAM·~5% speed penalty vs ideal
Command R+ 104B
Fits
104B·Q4_K_M → 70 GB·23% of effective VRAM·~5% speed penalty vs ideal
Command R+ (Aug 2024)
Fits
104B·AWQ-INT4 → 72 GB·24% of effective VRAM·~5% speed penalty vs ideal
Llama 3.2 90B Vision
Fits
90B·AWQ-INT4 → 64 GB·21% of effective VRAM·~5% speed penalty vs ideal
Llama 3.2 90B Vision Instruct
Fits
90B·Q4_K_M → 60 GB·20% of effective VRAM·~5% speed penalty vs ideal
InternVL 2.5 78B
Fits
78B·Q4_K_M → 52 GB·17% of effective VRAM·~5% speed penalty vs ideal
Qwen 2.5 72B Instruct
Fits
72B·Q4_K_M → 48 GB·16% of effective VRAM·~5% speed penalty vs ideal
Molmo 72B
Fits
72B·Q4_K_M → 48 GB·16% of effective VRAM·~5% speed penalty vs ideal
Qwen 2.5-VL 72B
Fits
72B·AWQ-INT4 → 48 GB·16% of effective VRAM·~5% speed penalty vs ideal
Qwen 3 72B
Fits
72B·AWQ-INT4 → 48 GB·16% of effective VRAM·~5% speed penalty vs ideal

Borderline (5)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

Llama 4 405B
Borderline
405B·AWQ-INT4 → 280 GB·93% of effective VRAM·~5% speed penalty vs ideal

Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 4 Maverick
Borderline
400B·Q4_K_M → 280 GB·93% of effective VRAM·~5% speed penalty vs ideal

Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Jamba 1.5 Large
Borderline
398B·Q4_K_M → 260 GB·87% of effective VRAM·~5% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Qwen 3.5 235B-A17B (MoE)
Borderline
397B·Q4_K_M → 256 GB·85% of effective VRAM·~5% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Hunyuan Large 389B MoE
Borderline
389B·Q4_K_M → 260 GB·87% of effective VRAM·~5% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Not practical (7)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)
Not practical
1600B·Q4_K_M → 1024 GB·341% of effective VRAM·~5% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Kimi K2.6
Not practical
1000B·Q4_K_M → 700 GB·233% of effective VRAM·~5% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Step-3
Not practical
1000B·AWQ-INT4 → 640 GB·213% of effective VRAM·~5% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V4
Not practical
745B·AWQ-INT4 → 480 GB·160% of effective VRAM·~5% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Mistral Medium 3.5 (675B MoE)
Not practical
675B·Q4_K_M → 448 GB·149% of effective VRAM·~5% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek R1 (671B reasoning)
Not practical
671B·Q4_K_M → 420 GB·140% of effective VRAM·~5% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V3 (671B MoE)
Not practical
671B·Q4_K_M → 420 GB·140% of effective VRAM·~5% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Benchmark opportunities

estimates, not measurements

Pending benchmark targets for this combo. Once measured, results land in the catalog as benchmarks.

4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)
pending
Estimate: 60-90 tok/s decode (single stream)

Frontier MoE on the datacenter reference rig. FP8 fits comfortably in 4× 80GB; expect strong per-stream decode and dramatic concurrency lift via SGLang RadixAttention.

4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)
pending
Estimate: 100-160 tok/s decode (single stream)

DeepSeek V4 Flash is the throughput-tuned V4 sibling. 80B/12B-active on 4× H100 should produce strongest open-weight tok/s in 2026.

Going deeper

  • Full combo detail page — operational review with failure modes and runtime matrix.
  • Multi-GPU buying guide — when multi-GPU is worth it and when it isn't.
  • Will-it-run home — single-card check + custom builds.