RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
  1. >
  2. Home
  3. /Models
  4. /Step-3
stepfun
1000B parameters
Restricted
·Reviewed May 2026

Step-3

StepFun's 1T-parameter MoE. 38B active. One of the largest open-weight models; cluster-only at any quant. Restricted license.

License: Step License·Released Sep 30, 2025·Context: 65,536 tokens

Overview

StepFun's 1T-parameter MoE. 38B active. One of the largest open-weight models; cluster-only at any quant. Restricted license.

How to run it

Step-3 is a 1T MoE (speculated ~40-60B active) from StepFun. No consumer path. Run on 4-8× H100 SXM at FP8 via vLLM with tensor-parallel=4. If vLLM MoE routing support is immature, fall back to SGLang with --tp 4. Q4 quantization (200 GB on disk) needs 4× A100 80GB or 2× H100 80GB minimum at 4K context. Bump to 8× H100 for 16K context. Expected throughput: 15-30 tok/s per user at FP8 on 4× H100 (estimate — validation is thin). No viable single-GPU path. No viable Apple Silicon path even at Mac Studio M3 Ultra 192 GB — Q2 may load but 2-4 tok/s makes it academic. Verify StepFun's license and weight availability before allocating cluster time.

Hardware guidance

Minimum: 4× A100 80GB at Q4 (speculative — Step-3 tooling is unvalidated). Recommended: 4-8× H100 SXM at FP8. VRAM math: MoE with ~1T total, ~40-60B active per token. Q4 full weights ~200 GB on disk. KV cache at 16K context adds ~15-25 GB per replica. 4× H100 (320 GB total) covers Q4 weights + KV cache for batch=1. For FP8, 8× H100 (640 GB) is necessary. RTX 6000 Ada 48GB is insufficient per card for tensor-parallel splits. Mac Studio M3 Ultra 192 GB at Q2 is the only consumer-adjacent path (3-6 tok/s expected) but untested. Cloud: RunPod/Lambda H100 cluster at $25-40/hr/node.

What breaks first

  1. vLLM MoE routing: Step-3 uses StepFun's custom MoE architecture. vLLM's generic MoE kernels may not fuse correctly, causing silent correctness failures or NaN outputs. Validate against known reference outputs before trusting results. 2. Tensor-parallel communication: At 4-8 nodes, NCCL ring latency becomes dominant. MFU below 30% is common on non-NVLink clusters. 3. Weight availability: As of mid-2026, Step-3 weights may not be publicly downloadable. Verify hf repo exists before provisioning compute. 4. Quantization toolchain gap: llama.cpp may not support Step-3's architecture — GGUF quantization depends on architecture-specific kernels. Expect 2-4 weeks of engineering to add support if missing.

Runtime recommendation

Best path today: vLLM with tensor-parallel=4 on H100s. If vLLM MoE routing fails, SGLang is the fallback — both support custom MoE with --tp. Avoid Ollama and llama.cpp unless Step-3 architecture support is confirmed. Avoid MLX-LM — Apple Silicon is not viable for this model size at useful throughput.

Common beginner mistakes

Mistake: Assuming Ollama pull step-3 works. Fix: Check Ollama's supported model list first — Step-3 likely isn't added yet. Use vLLM or SGLang. Mistake: Renting single H100 and expecting it to load. Fix: MoE with ~200 GB weights at Q4 needs minimum 4× 80GB GPUs. Single H100 has 80 GB. Do the VRAM math before renting. Mistake: Trusting benchmark scores without independent validation. Fix: Step-3 has minimal third-party eval data as of mid-2026. Run your own benchmarks on your hardware before committing to production. Mistake: Assuming Apache/MIT license. Fix: StepFun's license is unconfirmed. Verify commercial terms before any production deployment.

Strengths

  • Frontier scale; strong on multilingual

Weaknesses

  • Multi-machine cluster only
  • Restricted license

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
AWQ-INT4565.0 GB640 GB

Get the model

HuggingFace

Original weights

huggingface.co/stepfun-ai/Step-3

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Step-3.

NVIDIA GB200 NVL72
13824GB · nvidia

Frequently asked

What's the minimum VRAM to run Step-3?

640GB of VRAM is enough to run Step-3 at the AWQ-INT4 quantization (file size 565.0 GB). Higher-quality quantizations need more.

Can I use Step-3 commercially?

Step-3 is released under the Step License, which has restrictions for commercial use. Review the license terms before using it in a product.

What's the context length of Step-3?

Step-3 supports a context window of 65,536 tokens (about 66K).

Source: huggingface.co/stepfun-ai/Step-3

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware
  • Dual 3090 vs RTX 5090 (48 GB or 32 GB) →
  • RTX 3090 vs RTX 4090 →
Buyer guides
  • 16 GB vs 24 GB VRAM — what 70B-class models need →
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Recommended hardware
  • NVIDIA GB200 NVL72 →
Before you buy

Verify Step-3 runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier
Models in the same parameter band as this one
  • DeepSeek V4 Pro (1.6T MoE)
    deepseek · 1600B
    unrated
  • Qwen 3.5 235B-A17B (MoE)
    qwen · 397B
    unrated
  • Qwen 3 235B-A22B
    qwen · 235B
    unrated
  • DeepSeek V4 Flash (284B MoE)
    deepseek · 284B
    unrated
Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.
Step down
Smaller — faster, runs on weaker hardware
  • Llama 3.3 70B Instruct
    llama · 70B
    9.1/10
  • DeepSeek R1 Distill Llama 70B
    deepseek · 70B
    9.0/10