llama
400B parameters
Commercial OK
Multimodal

Llama 4 Maverick

Meta's high-end Llama 4 sibling — 128 experts MoE built for performance over efficiency. Multilingual strength is its standout. Effectively a server-tier model; consumer hardware can't load it without aggressive quantization and offloading.

License: Llama 4 Community License·Released Apr 5, 2026·Context: 1,000,000 tokens
Our verdict
By Fredoline Eruo·Last verified May 6, 2026
8.7/10
Positioning

Llama 4 Maverick is the model you run when you have a Mac Studio M2/M3 Ultra with 192+ GB unified memory, a workstation with 80+ GB VRAM across dual cards, or an H100. Same active-parameter footprint as Scout (~17B per token) but a much larger expert pool — quality lifts noticeably on hard tasks.

Strengths
  • Frontier-adjacent quality for an open-weight model — closes most of the remaining gap with closed models on the GPT-4-class workload mix.
  • MoE compute story remains favorable — only 17B active per token means 8–15 tok/s on properly-resourced hardware despite the 400B nameplate.
  • Native multimodal like Scout, but the larger expert pool gives better dense reasoning on charts, tables, and code-with-screenshot workflows.
Limitations
  • 400B total parameters — disk footprint at Q4 is ~225 GB, working set similar. This is "do you own a workstation" hardware.
  • MoE quality at very low quants drops faster than dense models — Q3 and below show degraded routing decisions; Q4 minimum.
  • License audit recommended before commercial deployment given Llama 4's revised AUP.
Real-world performance on RTX 4090
  • Q4_K_M (~225 GB) — not realistically runnable on 4090 even with offload; system RAM bandwidth becomes the bottleneck
  • Q3_K_M (~165 GB) — possible on dual 4090 + 192 GB DDR5, ~3–5 tok/s; not recommended (quality cliff)
  • Comfortable on: Mac Studio M2/M3 Ultra 192 GB or 4×A100 80 GB
Should you run this locally?

Yes, for owners of M-series Ultra Macs (the unified memory makes this model uniquely accessible to Mac users) and workstation rigs with 80+ GB VRAM. No, for anyone on consumer GPUs — the model is genuinely workstation-class and partial offload onto consumer DDR5 is too slow to be productive.

How it compares
  • vs Llama 4 Scout → Maverick is materially smarter on hard reasoning + dense visual tasks; Scout fits in human-budget hardware. Choose by what you can afford to feed.
  • vs Llama 3.3 70B → Maverick wins on quality, multimodality, and long context; Llama 3.3 70B wins on practicality (runs on a single 24 GB card).
  • vs Qwen 3 235B-A22B → Qwen 3 235B-A22B is the closest open-weight peer at scale, with similar MoE structure but smaller total params (235B vs 400B). Qwen edges on multilingual; Llama edges on tool use + ecosystem.
Run this yourself
# Mac Studio M2/M3 Ultra example
ollama pull llama4:maverick
ollama run llama4:maverick
Settings: Q4_K_M GGUF, 16384 ctx, MLX or Metal backend, M2 Ultra 192 GB
Why this rating

8.7/10 — the real Llama 4 flagship for serious local deployment. The 400B-total / 17B-active design wins on quality vs Scout while running at the same speed; the entire question is whether you have the disk and memory.

Overview

Meta's high-end Llama 4 sibling — 128 experts MoE built for performance over efficiency. Multilingual strength is its standout. Effectively a server-tier model; consumer hardware can't load it without aggressive quantization and offloading.

Strengths

  • 128-expert MoE for top quality
  • Strong multilingual coverage
  • Best-in-class for Meta family

Weaknesses

  • Server-tier only on consumer hardware
  • Slower per-token than Scout despite same active params
  • Heavy disk footprint

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M240.0 GB280 GB

Get the model

Ollama

One-line install

ollama run llama4:maverickRead our Ollama review →

HuggingFace

Original weights

huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Llama 4 Maverick.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.

Frequently asked

What's the minimum VRAM to run Llama 4 Maverick?

280GB of VRAM is enough to run Llama 4 Maverick at the Q4_K_M quantization (file size 240.0 GB). Higher-quality quantizations need more.

Can I use Llama 4 Maverick commercially?

Yes — Llama 4 Maverick ships under the Llama 4 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Llama 4 Maverick?

Llama 4 Maverick supports a context window of 1,000,000 tokens (about 1000K).

How do I install Llama 4 Maverick with Ollama?

Run `ollama pull llama4:maverick` to download, then `ollama run llama4:maverick` to start a chat session. The default quantization is Q4_K_M.

Does Llama 4 Maverick support images?

Yes — Llama 4 Maverick is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.