Mistral Nemo 12B Instruct
Joint Mistral/NVIDIA release with native 128K context and a new Tekken tokenizer. Strong multilingual; popular fine-tune base.
A NVIDIA-Mistral collaboration that targets the 12B class with European multilingual strength and a 128K context. The right pick when you specifically need Apache license + 12B-class capability + non-English performance.
Strengths- Apache 2.0 license — cleanest in this size tier.
- Strong European multilingual — French, German, Spanish, Italian, Portuguese are all near-native quality.
- 128K context with reasonable recall — better than Llama 3.1 8B at the same advertised window.
- Quality lags Qwen 2.5 14B for similar VRAM at Q4.
- Knowledge breadth is narrower than Llama 3.1 8B on English long-tail facts.
- No thinking-mode option — straight dense model.
- Q4_K_M (7.5 GB): 78–95 tok/s decode, TTFT ~95 ms
- Q5_K_M (8.7 GB): 68–82 tok/s
- Q8_0 (13.0 GB): 50–62 tok/s
Yes, for Apache-licensed European multilingual workloads, or as a strict upgrade from Mistral 7B v0.3 in existing pipelines. No, for English-only tasks where Qwen 2.5 14B or Llama 3.1 8B are stronger.
How it compares- vs Mistral 7B v0.3 → Nemo replaces 7B v0.3 in the modern Mistral lineup; same Apache license, materially stronger.
- vs Llama 3.1 8B → close call. Llama wins on English instruction polish; Nemo wins on multilingual + license simplicity.
- vs Qwen 2.5 14B → Qwen 2.5 14B is stronger absolute capability; Nemo has cleaner license.
- vs Pixtral 12B → Pixtral is the multimodal sibling; pick Pixtral if you need vision, Nemo if text-only.
ollama pull mistral-nemo:12b-instruct-q4_K_M
ollama run mistral-nemo:12b-instruct-q4_K_M
Settings: Q4_K_M GGUF, 32768 ctx, llama.cpp/CUDA, RTX 4090
›Why this rating
7.8/10 — the 12B Apache-licensed alternative to Llama 3.1 8B and Qwen 2.5 14B. Solid all-rounder with excellent multilingual for a 12B, but doesn't beat either neighbor decisively. Loses points by sitting in an awkward middle.
Overview
Joint Mistral/NVIDIA release with native 128K context and a new Tekken tokenizer. Strong multilingual; popular fine-tune base.
Strengths
- 128K context
- Apache 2.0
- Multilingual
Weaknesses
- Tekken tokenizer slow to spread
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 7.5 GB | 9 GB |
| Q5_K_M | 8.7 GB | 11 GB |
| Q8_0 | 13.0 GB | 15 GB |
Get the model
Ollama
One-line install
ollama run mistral-nemo:12bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Mistral Nemo 12B Instruct.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Mistral Nemo 12B Instruct?
Can I use Mistral Nemo 12B Instruct commercially?
What's the context length of Mistral Nemo 12B Instruct?
How do I install Mistral Nemo 12B Instruct with Ollama?
Source: huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.