hunyuan

389B parameters

Commercial OK

Reviewed May 2026

Hunyuan Large 389B MoE

Tencent's frontier MoE. 389B total / 52B active. License permits commercial use with restrictions on companies above MAU thresholds.

License: Tencent Hunyuan License·Released Nov 5, 2024·Context: 256,000 tokens

Overview

Tencent's frontier MoE. 389B total / 52B active. License permits commercial use with restrictions on companies above MAU thresholds.

How to run it

Hunyuan-Large is Tencent's 389B MoE model (~52B active). Run at Q4_K_M via llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~180 GB on disk (total params), but only ~52B active per token. Minimum VRAM: 96 GB — dual RTX A6000 (48GB each) with tensor-split, or single A100 80GB if Q4_K_M fits within 80 GB (check actual file size). For GPU-poor setups: CPU-only inference at Q4_K_M is viable on a server with 256+ GB RAM — ~3-6 tok/s on high-core-count Xeon/Epyc. MoE experts are loaded on-demand; llama.cpp offloads experts to system RAM when VRAM is tight, at the cost of speed when routing hits RAM-resident experts. For serving: vLLM on 2-4× A100 with tensor-parallel=2 at AWQ-INT4. Context: 32K max advertised; practical usable range at Q4_K_M on 96 GB is ~8-16K.

Hardware guidance

Minimum: A100 80GB at Q4_K_M (180 GB on disk, ~80-90 GB active resident at Q4 = tight but feasible). Recommended: dual RTX A6000 96 GB total at Q4_K_M with row-split (8-16K context). Budget path: CPU-only on 256 GB RAM server at Q4_K_M (3-6 tok/s). VRAM math: 389B total, ~52B active. Q4_K_M for active subset ≈ 30 GB. Expert weights (inactive) stored in VRAM or RAM depending on offload strategy. llama.cpp with KV offload and expert offload to RAM reduces VRAM requirement but adds latency on expert switches. RTX 4090 24GB: Q3_K_M with aggressive expert offload to RAM. Mac Studio M4 Ultra 128GB can run Q4_K_M at ~4-8 tok/s. Cloud: 2× A100 at ~$16-30/hr.

What breaks first

Expert routing stall. When experts are offloaded to system RAM, a routing decision that hits a RAM-resident expert adds 50-200ms latency. At low batch sizes this creates visible stutter in generation. Keep as many experts in VRAM as possible. 2. Chinese-language bias. Hunyuan-Large is Tencent's model — training data is Chinese-heavy. English quality is competitive but may show Chinese-culture bias in nuanced prompts. 3. AWQ on MoE. AWQ-INT4 quantization on MoE architectures can produce worse degradation on expert-routing stability versus dense models. Test routing correctness at Q4 before deploying. 4. Tensor-split imbalance. llama.cpp row-split across mismatched GPUs (e.g., A6000 + RTX 3090) causes the faster GPU to idle waiting for the slower GPU. Use identical GPU pairs.

Runtime recommendation

llama.cpp with -ngl 999 and expert offload tuning for single-node. vLLM for multi-user serving with tensor-parallel=2 on A100. SGLang if vLLM MoE routing is unstable. Avoid Ollama — MoE expert offload isn't exposed in Ollama's config surface and default settings may cause OOM.

Common beginner mistakes

Mistake: Assuming "52B active" means it fits in 32 GB VRAM. Fix: All 389B experts must be accessible (disk/RAM/VRAM). The 52B is what's computed per token, not what's stored. Minimum ~180 GB storage for Q4. Mistake: Expecting consistent generation speed. Fix: Expert routing means some tokens route to VRAM-resident experts (fast) and some to RAM-resident experts (50-200ms stall). Speed varies per token. Mistake: Using Q8 for the full MoE. Fix: Q8 for 389B is ~350 GB — requires 4-8× A100. Start at Q4_K_M. Mistake: Ignoring license restrictions. Fix: Tencent's license for Hunyuan-Large may restrict commercial use. Verify on huggingface.co/tencent/Hunyuan-Large before production deployment.

Strengths

Open-weight frontier MoE
Strong on Chinese + English

Weaknesses

Multi-machine cluster required
Tier-restricted commercial license

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	220.0 GB	260 GB

Get the model

HuggingFace

Original weights

huggingface.co/tencent/Tencent-Hunyuan-Large

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Hunyuan Large 389B MoE.

Frequently asked

What's the minimum VRAM to run Hunyuan Large 389B MoE?

260GB of VRAM is enough to run Hunyuan Large 389B MoE at the Q4_K_M quantization (file size 220.0 GB). Higher-quality quantizations need more.

Can I use Hunyuan Large 389B MoE commercially?

Yes — Hunyuan Large 389B MoE ships under the Tencent Hunyuan License, which permits commercial use. Always read the license text before deployment.

What's the context length of Hunyuan Large 389B MoE?

Hunyuan Large 389B MoE supports a context window of 256,000 tokens (about 256K).

Source: huggingface.co/tencent/Tencent-Hunyuan-Large

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Before you buy

Verify Hunyuan Large 389B MoE runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →

hunyuan

389B parameters

Commercial OK

Reviewed May 2026

Hunyuan Large 389B MoE

Tencent's frontier MoE. 389B total / 52B active. License permits commercial use with restrictions on companies above MAU thresholds.

License: Tencent Hunyuan License·Released Nov 5, 2024·Context: 256,000 tokens

Overview

Tencent's frontier MoE. 389B total / 52B active. License permits commercial use with restrictions on companies above MAU thresholds.

How to run it

Hardware guidance

What breaks first

Expert routing stall. When experts are offloaded to system RAM, a routing decision that hits a RAM-resident expert adds 50-200ms latency. At low batch sizes this creates visible stutter in generation. Keep as many experts in VRAM as possible. 2. Chinese-language bias. Hunyuan-Large is Tencent's model — training data is Chinese-heavy. English quality is competitive but may show Chinese-culture bias in nuanced prompts. 3. AWQ on MoE. AWQ-INT4 quantization on MoE architectures can produce worse degradation on expert-routing stability versus dense models. Test routing correctness at Q4 before deploying. 4. Tensor-split imbalance. llama.cpp row-split across mismatched GPUs (e.g., A6000 + RTX 3090) causes the faster GPU to idle waiting for the slower GPU. Use identical GPU pairs.

Runtime recommendation

Common beginner mistakes

Strengths

Open-weight frontier MoE
Strong on Chinese + English

Weaknesses

Multi-machine cluster required
Tier-restricted commercial license

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	220.0 GB	260 GB

Get the model

HuggingFace

Original weights

huggingface.co/tencent/Tencent-Hunyuan-Large

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Hunyuan Large 389B MoE.

Frequently asked

What's the minimum VRAM to run Hunyuan Large 389B MoE?

260GB of VRAM is enough to run Hunyuan Large 389B MoE at the Q4_K_M quantization (file size 220.0 GB). Higher-quality quantizations need more.

Can I use Hunyuan Large 389B MoE commercially?

Yes — Hunyuan Large 389B MoE ships under the Tencent Hunyuan License, which permits commercial use. Always read the license text before deployment.

What's the context length of Hunyuan Large 389B MoE?

Hunyuan Large 389B MoE supports a context window of 256,000 tokens (about 256K).

Source: huggingface.co/tencent/Tencent-Hunyuan-Large

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Before you buy

Verify Hunyuan Large 389B MoE runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →