stepfun

1000B parameters

Restricted

Reviewed May 2026

Step-3

StepFun's 1T-parameter MoE. 38B active. One of the largest open-weight models; cluster-only at any quant. Restricted license.

License: Step License·Released Sep 30, 2025·Context: 65,536 tokens

Overview

StepFun's 1T-parameter MoE. 38B active. One of the largest open-weight models; cluster-only at any quant. Restricted license.

How to run it

Step-3 is a 1T MoE (speculated ~40-60B active) from StepFun. No consumer path. Run on 4-8× H100 SXM at FP8 via vLLM with tensor-parallel=4. If vLLM MoE routing support is immature, fall back to SGLang with --tp 4. Q4 quantization (200 GB on disk) needs 4× A100 80GB or 2× H100 80GB minimum at 4K context. Bump to 8× H100 for 16K context. Expected throughput: 15-30 tok/s per user at FP8 on 4× H100 (estimate — validation is thin). No viable single-GPU path. No viable Apple Silicon path even at Mac Studio M3 Ultra 192 GB — Q2 may load but 2-4 tok/s makes it academic. Verify StepFun's license and weight availability before allocating cluster time.

Hardware guidance

Minimum: 4× A100 80GB at Q4 (speculative — Step-3 tooling is unvalidated). Recommended: 4-8× H100 SXM at FP8. VRAM math: MoE with ~1T total, ~40-60B active per token. Q4 full weights ~200 GB on disk. KV cache at 16K context adds ~15-25 GB per replica. 4× H100 (320 GB total) covers Q4 weights + KV cache for batch=1. For FP8, 8× H100 (640 GB) is necessary. RTX 6000 Ada 48GB is insufficient per card for tensor-parallel splits. Mac Studio M3 Ultra 192 GB at Q2 is the only consumer-adjacent path (3-6 tok/s expected) but untested. Cloud: RunPod/Lambda H100 cluster at $25-40/hr/node.

What breaks first

vLLM MoE routing: Step-3 uses StepFun's custom MoE architecture. vLLM's generic MoE kernels may not fuse correctly, causing silent correctness failures or NaN outputs. Validate against known reference outputs before trusting results. 2. Tensor-parallel communication: At 4-8 nodes, NCCL ring latency becomes dominant. MFU below 30% is common on non-NVLink clusters. 3. Weight availability: As of mid-2026, Step-3 weights may not be publicly downloadable. Verify hf repo exists before provisioning compute. 4. Quantization toolchain gap: llama.cpp may not support Step-3's architecture — GGUF quantization depends on architecture-specific kernels. Expect 2-4 weeks of engineering to add support if missing.

Runtime recommendation

Best path today: vLLM with tensor-parallel=4 on H100s. If vLLM MoE routing fails, SGLang is the fallback — both support custom MoE with --tp. Avoid Ollama and llama.cpp unless Step-3 architecture support is confirmed. Avoid MLX-LM — Apple Silicon is not viable for this model size at useful throughput.

Common beginner mistakes

Mistake: Assuming Ollama pull step-3 works. Fix: Check Ollama's supported model list first — Step-3 likely isn't added yet. Use vLLM or SGLang. Mistake: Renting single H100 and expecting it to load. Fix: MoE with ~200 GB weights at Q4 needs minimum 4× 80GB GPUs. Single H100 has 80 GB. Do the VRAM math before renting. Mistake: Trusting benchmark scores without independent validation. Fix: Step-3 has minimal third-party eval data as of mid-2026. Run your own benchmarks on your hardware before committing to production. Mistake: Assuming Apache/MIT license. Fix: StepFun's license is unconfirmed. Verify commercial terms before any production deployment.

Strengths

Frontier scale; strong on multilingual

Weaknesses

Multi-machine cluster only
Restricted license

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
AWQ-INT4	565.0 GB	640 GB

Get the model

HuggingFace

Original weights

huggingface.co/stepfun-ai/Step-3

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Step-3.

NVIDIA GB200 NVL72

13824GB · nvidia

Frequently asked

What's the minimum VRAM to run Step-3?

640GB of VRAM is enough to run Step-3 at the AWQ-INT4 quantization (file size 565.0 GB). Higher-quality quantizations need more.

Can I use Step-3 commercially?

Step-3 is released under the Step License, which has restrictions for commercial use. Review the license terms before using it in a product.

What's the context length of Step-3?

Step-3 supports a context window of 65,536 tokens (about 66K).

Source: huggingface.co/stepfun-ai/Step-3

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

NVIDIA GB200 NVL72 →

Before you buy

Verify Step-3 runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →

stepfun

1000B parameters

Restricted

Reviewed May 2026

Step-3

StepFun's 1T-parameter MoE. 38B active. One of the largest open-weight models; cluster-only at any quant. Restricted license.

License: Step License·Released Sep 30, 2025·Context: 65,536 tokens

Overview

StepFun's 1T-parameter MoE. 38B active. One of the largest open-weight models; cluster-only at any quant. Restricted license.

How to run it

Hardware guidance

What breaks first

vLLM MoE routing: Step-3 uses StepFun's custom MoE architecture. vLLM's generic MoE kernels may not fuse correctly, causing silent correctness failures or NaN outputs. Validate against known reference outputs before trusting results. 2. Tensor-parallel communication: At 4-8 nodes, NCCL ring latency becomes dominant. MFU below 30% is common on non-NVLink clusters. 3. Weight availability: As of mid-2026, Step-3 weights may not be publicly downloadable. Verify hf repo exists before provisioning compute. 4. Quantization toolchain gap: llama.cpp may not support Step-3's architecture — GGUF quantization depends on architecture-specific kernels. Expect 2-4 weeks of engineering to add support if missing.