RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
  1. >
  2. Home
  3. /Models
  4. /Hunyuan Large 389B MoE
hunyuan
389B parameters
Commercial OK
·Reviewed May 2026

Hunyuan Large 389B MoE

Tencent's frontier MoE. 389B total / 52B active. License permits commercial use with restrictions on companies above MAU thresholds.

License: Tencent Hunyuan License·Released Nov 5, 2024·Context: 256,000 tokens

Overview

Tencent's frontier MoE. 389B total / 52B active. License permits commercial use with restrictions on companies above MAU thresholds.

How to run it

Hunyuan-Large is Tencent's 389B MoE model (~52B active). Run at Q4_K_M via llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~180 GB on disk (total params), but only ~52B active per token. Minimum VRAM: 96 GB — dual RTX A6000 (48GB each) with tensor-split, or single A100 80GB if Q4_K_M fits within 80 GB (check actual file size). For GPU-poor setups: CPU-only inference at Q4_K_M is viable on a server with 256+ GB RAM — ~3-6 tok/s on high-core-count Xeon/Epyc. MoE experts are loaded on-demand; llama.cpp offloads experts to system RAM when VRAM is tight, at the cost of speed when routing hits RAM-resident experts. For serving: vLLM on 2-4× A100 with tensor-parallel=2 at AWQ-INT4. Context: 32K max advertised; practical usable range at Q4_K_M on 96 GB is ~8-16K.

Hardware guidance

Minimum: A100 80GB at Q4_K_M (180 GB on disk, ~80-90 GB active resident at Q4 = tight but feasible). Recommended: dual RTX A6000 96 GB total at Q4_K_M with row-split (8-16K context). Budget path: CPU-only on 256 GB RAM server at Q4_K_M (3-6 tok/s). VRAM math: 389B total, ~52B active. Q4_K_M for active subset ≈ 30 GB. Expert weights (inactive) stored in VRAM or RAM depending on offload strategy. llama.cpp with KV offload and expert offload to RAM reduces VRAM requirement but adds latency on expert switches. RTX 4090 24GB: Q3_K_M with aggressive expert offload to RAM. Mac Studio M4 Ultra 128GB can run Q4_K_M at ~4-8 tok/s. Cloud: 2× A100 at ~$16-30/hr.

What breaks first

  1. Expert routing stall. When experts are offloaded to system RAM, a routing decision that hits a RAM-resident expert adds 50-200ms latency. At low batch sizes this creates visible stutter in generation. Keep as many experts in VRAM as possible. 2. Chinese-language bias. Hunyuan-Large is Tencent's model — training data is Chinese-heavy. English quality is competitive but may show Chinese-culture bias in nuanced prompts. 3. AWQ on MoE. AWQ-INT4 quantization on MoE architectures can produce worse degradation on expert-routing stability versus dense models. Test routing correctness at Q4 before deploying. 4. Tensor-split imbalance. llama.cpp row-split across mismatched GPUs (e.g., A6000 + RTX 3090) causes the faster GPU to idle waiting for the slower GPU. Use identical GPU pairs.

Runtime recommendation

llama.cpp with -ngl 999 and expert offload tuning for single-node. vLLM for multi-user serving with tensor-parallel=2 on A100. SGLang if vLLM MoE routing is unstable. Avoid Ollama — MoE expert offload isn't exposed in Ollama's config surface and default settings may cause OOM.

Common beginner mistakes

Mistake: Assuming "52B active" means it fits in 32 GB VRAM. Fix: All 389B experts must be accessible (disk/RAM/VRAM). The 52B is what's computed per token, not what's stored. Minimum ~180 GB storage for Q4. Mistake: Expecting consistent generation speed. Fix: Expert routing means some tokens route to VRAM-resident experts (fast) and some to RAM-resident experts (50-200ms stall). Speed varies per token. Mistake: Using Q8 for the full MoE. Fix: Q8 for 389B is ~350 GB — requires 4-8× A100. Start at Q4_K_M. Mistake: Ignoring license restrictions. Fix: Tencent's license for Hunyuan-Large may restrict commercial use. Verify on huggingface.co/tencent/Hunyuan-Large before production deployment.

Strengths

  • Open-weight frontier MoE
  • Strong on Chinese + English

Weaknesses

  • Multi-machine cluster required
  • Tier-restricted commercial license

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M220.0 GB260 GB

Get the model

HuggingFace

Original weights

huggingface.co/tencent/Tencent-Hunyuan-Large

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Hunyuan Large 389B MoE.

NVIDIA GB200 NVL72
13824GB · nvidia
AMD Instinct MI355X
288GB · amd

Frequently asked

What's the minimum VRAM to run Hunyuan Large 389B MoE?

260GB of VRAM is enough to run Hunyuan Large 389B MoE at the Q4_K_M quantization (file size 220.0 GB). Higher-quality quantizations need more.

Can I use Hunyuan Large 389B MoE commercially?

Yes — Hunyuan Large 389B MoE ships under the Tencent Hunyuan License, which permits commercial use. Always read the license text before deployment.

What's the context length of Hunyuan Large 389B MoE?

Hunyuan Large 389B MoE supports a context window of 256,000 tokens (about 256K).

Source: huggingface.co/tencent/Tencent-Hunyuan-Large

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware
  • Dual 3090 vs RTX 5090 (48 GB or 32 GB) →
  • RTX 3090 vs RTX 4090 →
Buyer guides
  • 16 GB vs 24 GB VRAM — what 70B-class models need →
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Recommended hardware
  • NVIDIA GB200 NVL72 →
  • AMD Instinct MI355X →
Before you buy

Verify Hunyuan Large 389B MoE runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier
Models in the same parameter band as this one
  • DeepSeek V4 Pro (1.6T MoE)
    deepseek · 1600B
    unrated
  • Qwen 3.5 235B-A17B (MoE)
    qwen · 397B
    unrated
  • Qwen 3 235B-A22B
    qwen · 235B
    unrated
  • DeepSeek V4 Flash (284B MoE)
    deepseek · 284B
    unrated
Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.
Step down
Smaller — faster, runs on weaker hardware
  • Llama 3.3 70B Instruct
    llama · 70B
    9.1/10
  • DeepSeek R1 Distill Llama 70B
    deepseek · 70B
    9.0/10