llama

405B parameters

Commercial OK

Reviewed May 2026

Llama 4 405B

Meta's dense flagship in the Llama 4 line. 405B params; comparable footprint to Llama 3.1 405B with the Llama 4 reasoning improvements.

License: Llama 4 Community License·Released Feb 10, 2026·Context: 131,072 tokens

Overview

Meta's dense flagship in the Llama 4 line. 405B params; comparable footprint to Llama 3.1 405B with the Llama 4 reasoning improvements.

How to run it

Llama 4 405B is Meta's largest dense model. 405B parameters, 231 GB on disk at Q4_K_M, ~405 GB at FP16. Single-GPU path does not exist — the smallest config that loads is 4× A100 80GB at Q4_K_M (320 GB pool) for batch=1 at 4K context. Recommended: 8× H100 SXM at FP8 with vLLM tensor-parallel=8. On 8× H100 at FP8: ~8-15 tok/s per user at batch=1. Q4_K_M on 4× A100: ~5-10 tok/s at batch=1. KV cache at 8K context adds ~25-35 GB. Llama 4 405B uses standard LLaMA architecture — broad ecosystem support. llama.cpp supports it with row-split across GPUs via CUDA_VISIBLE_DEVICES. Ollama default tag should use Q4_K_M. Mac Studio M4 Ultra 192 GB at Q2_K (120 GB) is theoretically loadable at 2-4 tok/s. Not recommended for interactive use. Cloud rental: 4× H100 at ~$25-40/hr.

Hardware guidance

Minimum: 4× A100 80GB at Q4_K_M (231 GB weights + ~12 GB KV at 4K = 243 GB, fits 320 GB pool). Recommended: 8× H100 SXM at FP8 with NVLink for tensor-parallel communication. VRAM math: dense 405B, Q4_K_M ~0.57 bytes/param → ~231 GB. KV cache: ~0.5 MB/token × context_length per layer. At 8K context, 405B with 128 layers adds ~25 GB. Total minimum VRAM: ~256 GB for Q4 at 8K batch=1. 4× A100 80GB = 320 GB — comfortable with headroom. 4× RTX A6000 48GB = 192 GB — insufficient for Q4_K_M, must use Q2_K (116 GB) with severe quality loss. Mac Studio M4 Ultra 128GB at Q2_K only. No single-consumer-GPU option.

What breaks first

Cross-GPU communication bottleneck. On non-NVLink setups (PCIe-only), tensor-parallel bandwidth becomes the bottleneck. MFU drops to 15-25% on 4× A100 without NVLink. Use NVLink-bridged pairs whenever possible. 2. First-token latency. 405B dense with tensor-parallel incurs 5-15 second time-to-first-token at 4K context on 4× A100. Not suitable for latency-sensitive applications without speculative decoding. 3. Q2 quality cliff. Q2_K quantization on 405B is viable for loading but quality degrades significantly on factual accuracy and complex reasoning. Benchmark your task before committing to Q2. 4. Ollama default tag may use insufficient context. Verify Ollama's default context length for Llama 4 405B — some tags default to 2048. Override with /set parameter.

Runtime recommendation

vLLM with tensor-parallel=8 on H100 SXM for serving. llama.cpp with -ngl 999 --tensor-split for multi-GPU local use. SGLang as alternative if vLLM memory management causes OOM at long context. Avoid Ollama for multi-GPU — it delegates to llama.cpp but obscures tensor-split config. Avoid MLX-LM — Apple Silicon not viable at this scale.

Common beginner mistakes

Mistake: Thinking dual RTX 4090 (48 GB) can run 405B. Fix: Q4 is 231 GB. Even Q2_K is ~116 GB. Do the math: 48 GB is 5× too small. Mistake: Running at 128K context on Q4_K_M 4× A100. Fix: KV cache at 128K is 300+ GB alone. 4K is the realistic starting point; 8K with headroom. Mistake: Using Ollama without checking --tensor-split. Fix: llama.cpp row-split requires explicit GPU assignment. Ollama obscures this. Use raw llama.cpp server for multi-GPU. Mistake: Expecting fast first-token on 405B. Fix: Time-to-first-token at 4K context is 5-15 seconds on 4× A100. Speculative decoding with a 7B draft model cuts this significantly. Mistake: Renting GPUs without NVLink and expecting high throughput. Fix: Without NVLink, MFU drops below 25%. Rent NVLink-bridged instances (A100 SXM, H100 SXM) if throughput matters.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (llama-4)

Llama 4 405B405B

You are here

Strengths

Frontier-tier reasoning
Strong multilingual

Weaknesses

Multi-node cluster only
Llama Community License usage restrictions for very large companies

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
AWQ-INT4	230.0 GB	280 GB

Get the model

HuggingFace

Original weights

huggingface.co/meta-llama/Llama-4-405B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Llama 4 405B.

Frequently asked

What's the minimum VRAM to run Llama 4 405B?

280GB of VRAM is enough to run Llama 4 405B at the AWQ-INT4 quantization (file size 230.0 GB). Higher-quality quantizations need more.

Can I use Llama 4 405B commercially?

Yes — Llama 4 405B ships under the Llama 4 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Llama 4 405B?

Llama 4 405B supports a context window of 131,072 tokens (about 131K).

Source: huggingface.co/meta-llama/Llama-4-405B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

Llama 4 Scout Llama 4 70B Llama 4 Maverick

Before you buy

Verify Llama 4 405B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →

llama

405B parameters

Commercial OK

Reviewed May 2026

Llama 4 405B

Meta's dense flagship in the Llama 4 line. 405B params; comparable footprint to Llama 3.1 405B with the Llama 4 reasoning improvements.

License: Llama 4 Community License·Released Feb 10, 2026·Context: 131,072 tokens

Overview

Meta's dense flagship in the Llama 4 line. 405B params; comparable footprint to Llama 3.1 405B with the Llama 4 reasoning improvements.

How to run it

Hardware guidance

What breaks first

Cross-GPU communication bottleneck. On non-NVLink setups (PCIe-only), tensor-parallel bandwidth becomes the bottleneck. MFU drops to 15-25% on 4× A100 without NVLink. Use NVLink-bridged pairs whenever possible. 2. First-token latency. 405B dense with tensor-parallel incurs 5-15 second time-to-first-token at 4K context on 4× A100. Not suitable for latency-sensitive applications without speculative decoding. 3. Q2 quality cliff. Q2_K quantization on 405B is viable for loading but quality degrades significantly on factual accuracy and complex reasoning. Benchmark your task before committing to Q2. 4. Ollama default tag may use insufficient context. Verify Ollama's default context length for Llama 4 405B — some tags default to 2048. Override with /set parameter.

Runtime recommendation

Common beginner mistakes

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (llama-4)

Llama 4 405B405B

You are here