llama
109B parameters
Commercial OK

Llama 4 Scout

Meta's 2026 flagship MoE model. 109B total parameters with only 17B active per forward pass and a record 10-million-token context window — unmatched in production at any tier. Built for long-document workflows, RAG over entire codebases, and continuous-context agents.

License: Llama 4 Community License·Released Apr 5, 2026·Context: 10,000,000 tokens
Our verdict
By Fredoline Eruo·Last verified May 6, 2026
8.4/10
Positioning

Llama 4 Scout is Meta's "small flagship" of the new generation — natively multimodal, MoE architecture (109B total, 17B active), and the first Llama with a serious long-context story. It's the Llama 3.3 70B replacement for users with ~64 GB VRAM or unified memory.

Strengths
  • Native vision-language — single model handles image+text without a separate adapter, unlike Llama 3.2 11B Vision's bolted-on approach.
  • MoE active parameters (17B) keep tokens/sec respectable at flagship quality — ~30–38 tok/s on a 4090 at Q4 with offload.
  • Architectural long context that genuinely works further than 32K — recall stays competitive into 100K territory in practice.
Limitations
  • 109B total params mean Q4 is ~62 GB — needs dual high-VRAM cards, an A6000-class workstation, or Apple Silicon with 96 GB+ unified memory.
  • License added new clauses vs Llama 3 — review the AUP if you ship at scale.
  • Vision quality is solid but not best-in-class — Pixtral and Qwen 2.5 VL still edge it on dense OCR and chart understanding.
Real-world performance on RTX 4090
  • Q4_K_M (62 GB) — heavy offload required: 8–14 tok/s, only practical with 64 GB+ system RAM
  • Q5_K_M (74 GB) — workstation only
  • Q8_0 (~110 GB) — Mac Studio territory
Should you run this locally?

Yes, for workstation rigs (dual 4090, A6000, RTX 6000 Ada) and high-RAM Mac Studios. Excellent native multimodal model. No, for single-card consumer setups — at Q4 you're CPU-offloaded; at lower quants, quality erodes faster than usual on MoE.

How it compares
  • vs Llama 3.3 70B → Scout is multimodal and has better architectural long context; Llama 3.3 70B is faster on a single 24 GB card. Pick Scout if you have the memory and want vision; otherwise stick with 3.3 70B.
  • vs Llama 4 Maverick → Maverick is the bigger sibling (400B/17B active). Same active compute but Maverick has a much larger expert pool — better quality if you can afford the disk + memory.
  • vs Qwen 2.5 VL 72B → Qwen 2.5 VL is stronger on dense visual reasoning; Scout is more usable as a general assistant. Different jobs.
Run this yourself
ollama pull llama4:scout
ollama run llama4:scout
Settings: Q4_K_M GGUF, 16384 ctx, --n-gpu-layers 30 of 49, RTX 4090 + 64 GB DDR5
Why this rating

8.4/10 — the smallest Llama 4 is the model most local users will actually run, with native multimodality and a 10M-context architecture. Loses points only because real-world recall over the full advertised context is still imperfect.

Overview

Meta's 2026 flagship MoE model. 109B total parameters with only 17B active per forward pass and a record 10-million-token context window — unmatched in production at any tier. Built for long-document workflows, RAG over entire codebases, and continuous-context agents.

Strengths

  • 10M token context (industry-leading)
  • Efficient MoE — runs at 17B-active speed
  • Strong tool/function calling

Weaknesses

  • Total weights still need 65GB+ VRAM at Q4
  • Long-context attention is RAM-hungry
  • Newer than Llama 3.x — less ecosystem battle-testing

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M65.0 GB80 GB
Q5_K_M78.0 GB95 GB
FP16218.0 GB240 GB

Get the model

Ollama

One-line install

ollama run llama4:scoutRead our Ollama review →

HuggingFace

Original weights

huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Llama 4 Scout.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.

Frequently asked

What's the minimum VRAM to run Llama 4 Scout?

80GB of VRAM is enough to run Llama 4 Scout at the Q4_K_M quantization (file size 65.0 GB). Higher-quality quantizations need more.

Can I use Llama 4 Scout commercially?

Yes — Llama 4 Scout ships under the Llama 4 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Llama 4 Scout?

Llama 4 Scout supports a context window of 10,000,000 tokens (about 10000K).

How do I install Llama 4 Scout with Ollama?

Run `ollama pull llama4:scout` to download, then `ollama run llama4:scout` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.