Step-3
StepFun's 1T-parameter MoE. 38B active. One of the largest open-weight models; cluster-only at any quant. Restricted license.
Overview
StepFun's 1T-parameter MoE. 38B active. One of the largest open-weight models; cluster-only at any quant. Restricted license.
How to run it
Step-3 is a 1T MoE (speculated ~40-60B active) from StepFun. No consumer path. Run on 4-8× H100 SXM at FP8 via vLLM with tensor-parallel=4. If vLLM MoE routing support is immature, fall back to SGLang with --tp 4. Q4 quantization (200 GB on disk) needs 4× A100 80GB or 2× H100 80GB minimum at 4K context. Bump to 8× H100 for 16K context. Expected throughput: 15-30 tok/s per user at FP8 on 4× H100 (estimate — validation is thin). No viable single-GPU path. No viable Apple Silicon path even at Mac Studio M3 Ultra 192 GB — Q2 may load but 2-4 tok/s makes it academic. Verify StepFun's license and weight availability before allocating cluster time.
Hardware guidance
Minimum: 4× A100 80GB at Q4 (speculative — Step-3 tooling is unvalidated). Recommended: 4-8× H100 SXM at FP8. VRAM math: MoE with ~1T total, ~40-60B active per token. Q4 full weights ~200 GB on disk. KV cache at 16K context adds ~15-25 GB per replica. 4× H100 (320 GB total) covers Q4 weights + KV cache for batch=1. For FP8, 8× H100 (640 GB) is necessary. RTX 6000 Ada 48GB is insufficient per card for tensor-parallel splits. Mac Studio M3 Ultra 192 GB at Q2 is the only consumer-adjacent path (3-6 tok/s expected) but untested. Cloud: RunPod/Lambda H100 cluster at $25-40/hr/node.
What breaks first
- vLLM MoE routing: Step-3 uses StepFun's custom MoE architecture. vLLM's generic MoE kernels may not fuse correctly, causing silent correctness failures or NaN outputs. Validate against known reference outputs before trusting results. 2. Tensor-parallel communication: At 4-8 nodes, NCCL ring latency becomes dominant. MFU below 30% is common on non-NVLink clusters. 3. Weight availability: As of mid-2026, Step-3 weights may not be publicly downloadable. Verify hf repo exists before provisioning compute. 4. Quantization toolchain gap: llama.cpp may not support Step-3's architecture — GGUF quantization depends on architecture-specific kernels. Expect 2-4 weeks of engineering to add support if missing.
Runtime recommendation
Best path today: vLLM with tensor-parallel=4 on H100s. If vLLM MoE routing fails, SGLang is the fallback — both support custom MoE with --tp. Avoid Ollama and llama.cpp unless Step-3 architecture support is confirmed. Avoid MLX-LM — Apple Silicon is not viable for this model size at useful throughput.
Common beginner mistakes
Mistake: Assuming Ollama pull step-3 works. Fix: Check Ollama's supported model list first — Step-3 likely isn't added yet. Use vLLM or SGLang. Mistake: Renting single H100 and expecting it to load. Fix: MoE with ~200 GB weights at Q4 needs minimum 4× 80GB GPUs. Single H100 has 80 GB. Do the VRAM math before renting. Mistake: Trusting benchmark scores without independent validation. Fix: Step-3 has minimal third-party eval data as of mid-2026. Run your own benchmarks on your hardware before committing to production. Mistake: Assuming Apache/MIT license. Fix: StepFun's license is unconfirmed. Verify commercial terms before any production deployment.
Strengths
- Frontier scale; strong on multilingual
Weaknesses
- Multi-machine cluster only
- Restricted license
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| AWQ-INT4 | 565.0 GB | 640 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Step-3.
Frequently asked
What's the minimum VRAM to run Step-3?
Can I use Step-3 commercially?
What's the context length of Step-3?
Source: huggingface.co/stepfun-ai/Step-3
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Step-3 runs on your specific hardware before committing money.