Hunyuan Large 389B MoE
Tencent's frontier MoE. 389B total / 52B active. License permits commercial use with restrictions on companies above MAU thresholds.
Overview
Tencent's frontier MoE. 389B total / 52B active. License permits commercial use with restrictions on companies above MAU thresholds.
How to run it
Hunyuan-Large is Tencent's 389B MoE model (~52B active). Run at Q4_K_M via llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~180 GB on disk (total params), but only ~52B active per token. Minimum VRAM: 96 GB — dual RTX A6000 (48GB each) with tensor-split, or single A100 80GB if Q4_K_M fits within 80 GB (check actual file size). For GPU-poor setups: CPU-only inference at Q4_K_M is viable on a server with 256+ GB RAM — ~3-6 tok/s on high-core-count Xeon/Epyc. MoE experts are loaded on-demand; llama.cpp offloads experts to system RAM when VRAM is tight, at the cost of speed when routing hits RAM-resident experts. For serving: vLLM on 2-4× A100 with tensor-parallel=2 at AWQ-INT4. Context: 32K max advertised; practical usable range at Q4_K_M on 96 GB is ~8-16K.
Hardware guidance
Minimum: A100 80GB at Q4_K_M (180 GB on disk, ~80-90 GB active resident at Q4 = tight but feasible). Recommended: dual RTX A6000 96 GB total at Q4_K_M with row-split (8-16K context). Budget path: CPU-only on 256 GB RAM server at Q4_K_M (3-6 tok/s). VRAM math: 389B total, ~52B active. Q4_K_M for active subset ≈ 30 GB. Expert weights (inactive) stored in VRAM or RAM depending on offload strategy. llama.cpp with KV offload and expert offload to RAM reduces VRAM requirement but adds latency on expert switches. RTX 4090 24GB: Q3_K_M with aggressive expert offload to RAM. Mac Studio M4 Ultra 128GB can run Q4_K_M at ~4-8 tok/s. Cloud: 2× A100 at ~$16-30/hr.
What breaks first
- Expert routing stall. When experts are offloaded to system RAM, a routing decision that hits a RAM-resident expert adds 50-200ms latency. At low batch sizes this creates visible stutter in generation. Keep as many experts in VRAM as possible. 2. Chinese-language bias. Hunyuan-Large is Tencent's model — training data is Chinese-heavy. English quality is competitive but may show Chinese-culture bias in nuanced prompts. 3. AWQ on MoE. AWQ-INT4 quantization on MoE architectures can produce worse degradation on expert-routing stability versus dense models. Test routing correctness at Q4 before deploying. 4. Tensor-split imbalance. llama.cpp row-split across mismatched GPUs (e.g., A6000 + RTX 3090) causes the faster GPU to idle waiting for the slower GPU. Use identical GPU pairs.
Runtime recommendation
Common beginner mistakes
Mistake: Assuming "52B active" means it fits in 32 GB VRAM. Fix: All 389B experts must be accessible (disk/RAM/VRAM). The 52B is what's computed per token, not what's stored. Minimum ~180 GB storage for Q4. Mistake: Expecting consistent generation speed. Fix: Expert routing means some tokens route to VRAM-resident experts (fast) and some to RAM-resident experts (50-200ms stall). Speed varies per token. Mistake: Using Q8 for the full MoE. Fix: Q8 for 389B is ~350 GB — requires 4-8× A100. Start at Q4_K_M. Mistake: Ignoring license restrictions. Fix: Tencent's license for Hunyuan-Large may restrict commercial use. Verify on huggingface.co/tencent/Hunyuan-Large before production deployment.
Strengths
- Open-weight frontier MoE
- Strong on Chinese + English
Weaknesses
- Multi-machine cluster required
- Tier-restricted commercial license
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 220.0 GB | 260 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Hunyuan Large 389B MoE.
Frequently asked
What's the minimum VRAM to run Hunyuan Large 389B MoE?
Can I use Hunyuan Large 389B MoE commercially?
What's the context length of Hunyuan Large 389B MoE?
Source: huggingface.co/tencent/Tencent-Hunyuan-Large
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Hunyuan Large 389B MoE runs on your specific hardware before committing money.