llama
1B parameters
Commercial OK

Llama 3.2 1B Instruct

True edge-tier Llama. Runs on a phone or Raspberry Pi. Useful for classification, simple summarization, and on-device agents.

License: Llama 3.2 Community License·Released Sep 25, 2024·Context: 131,072 tokens
Our verdict
By Fredoline Eruo·Last verified May 6, 2026
6.0/10
Positioning

A 1B model exists for one job: routing or classification inside agent loops, where you need decisions in 5–10 ms on minimal hardware. As a chat model, it's clearly the bottom of the useful spectrum — fine for trivial queries, struggles with anything multi-step.

Strengths
  • Under 1 GB at Q4_K_M — runs on Raspberry Pi 5 with NPU, mobile devices, anywhere with at least 2 GB free RAM.
  • Conversational tone holds up better than Phi 1.5 at similar parameter count.
  • Same permissive license as the rest of the Llama family.
Limitations
  • Multi-step reasoning fails frequently — pick a 3B+ for anything beyond a one-turn answer.
  • Hallucinates on factual questions more aggressively than expected; needs RAG or strict refusal prompting.
  • No structured-output reliability — JSON mode is unstable.
Real-world performance on RTX 4090
  • Q4_K_M (0.8 GB): 220–280 tok/s decode, TTFT under 30 ms
  • Q5_K_M (0.95 GB): 200–250 tok/s
  • Q8_0 (1.3 GB): 170–210 tok/s
Should you run this locally?

Yes, for routing layers in agent stacks (intent classification, query rewriting, tool selection), or for genuinely low-spec edge devices. No, for any standalone chat or assistant role.

How it compares
  • vs Llama 3.2 3B → 3B is much more capable; only pick 1B when memory or latency forces it.
  • vs Qwen 2.5 1.5B → Qwen 1.5B is meaningfully smarter at similar footprint; preferred for new edge work.
  • vs Phi-3.5 Mini (3.8B) → not the same class; Phi is for "small but capable", 1B is "tiny but functional".
Run this yourself
ollama pull llama3.2:1b-instruct-q4_K_M
ollama run llama3.2:1b-instruct-q4_K_M
Settings: Q4_K_M GGUF, 4096 ctx, llama.cpp/CUDA, RTX 4090 (or NPU/CPU)
Why this rating

6.0/10 — the smallest useful Llama. Below this size, models stop being general-purpose and become routing/classification helpers. Loses points to Qwen 2.5 1.5B which is more capable at near-equal footprint.

Overview

True edge-tier Llama. Runs on a phone or Raspberry Pi. Useful for classification, simple summarization, and on-device agents.

Strengths

  • Mobile-class footprint
  • Fast on CPU
  • 128K context

Weaknesses

  • Limited reasoning
  • Hallucinations more common

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M0.8 GB2 GB
Q8_01.3 GB2 GB

Get the model

Ollama

One-line install

ollama run llama3.2:1bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/meta-llama/Llama-3.2-1B-Instruct

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Llama 3.2 1B Instruct.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier
Models in the same parameter band as this one
No verdicted peers yet in this band.
Step up
More capable — bigger memory footprint
Step down
Smaller — faster, runs on weaker hardware
No verdicted models in the next tier down yet.

Frequently asked

What's the minimum VRAM to run Llama 3.2 1B Instruct?

2GB of VRAM is enough to run Llama 3.2 1B Instruct at the Q4_K_M quantization (file size 0.8 GB). Higher-quality quantizations need more.

Can I use Llama 3.2 1B Instruct commercially?

Yes — Llama 3.2 1B Instruct ships under the Llama 3.2 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Llama 3.2 1B Instruct?

Llama 3.2 1B Instruct supports a context window of 131,072 tokens (about 131K).

How do I install Llama 3.2 1B Instruct with Ollama?

Run `ollama pull llama3.2:1b` to download, then `ollama run llama3.2:1b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/meta-llama/Llama-3.2-1B-Instruct

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.