Phi-3.5 Mini Instruct
Compact 3.8B Phi for edge deployment. 128K context. Strong reasoning per parameter.
The right pick when VRAM is the gating constraint — sub-6 GB cards, integrated GPUs, edge devices, or as a fast secondary model for routing/classification in agent loops. Microsoft's curation against synthetic textbooks shows: it's startlingly capable for 3.8B parameters.
Strengths- 2.3 GB at Q4_K_M — runs on essentially anything that exists, including 4 GB GPUs with comfortable context.
- Structured output and math are genuinely good for the size class — better than Llama 3.2 3B on GSM8K and JSON-mode tasks.
- MIT license: cleanest license in the curated-data model space.
- Open-domain knowledge is shallow — the textbook-only training shows on pop culture, recent events, and obscure technical lore.
- Refusal behavior is aggressive — defaults to over-cautious answers on anything dual-use.
- Long-context recall is weak despite the 128K spec — past ~16K, quality degrades sharply.
- Q4_K_M (2.3 GB): 130–155 tok/s decode, TTFT under 50 ms
- Q5_K_M (2.8 GB): 120–140 tok/s
- Q8_0 (4.1 GB): 100–120 tok/s — surprisingly worth it; Q8 quality bump is larger than usual
Yes, for edge deployment, fast routing/classification in agent stacks, math-heavy structured tasks, or any rig with under 6 GB VRAM. No, for open-ended chat, creative writing, or current-events tasks.
How it compares- vs Llama 3.2 3B → Phi wins on math + structured output; Llama wins on conversational naturalness and knowledge breadth. Pick Phi for tooling, Llama for chat.
- vs Llama 3.1 8B → Llama 3.1 8B is materially more capable across the board but uses 2× VRAM. Phi is the right pick only when VRAM matters.
- vs Gemma 3 4B → very close call; Gemma 3 4B has a slight edge on multilingual + general chat, Phi 3.5 Mini wins on math + JSON. Both excellent in the 4B class.
- vs Phi-4 14B → not in the same class; Phi-4 is competitive with Llama 3.1 8B, Phi-3.5 Mini is a different efficiency tier.
ollama pull phi3.5:3.8b-mini-instruct-q4_K_M
ollama run phi3.5:3.8b-mini-instruct-q4_K_M
Settings: Q4_K_M GGUF, 4096 ctx, llama.cpp/CUDA, RTX 4090
›Why this rating
7.2/10 — punches well above its parameter count, especially on math and structured output. Loses points to general models with 2× the params for general chat, but no other 4B-class model is in this league.
Overview
Compact 3.8B Phi for edge deployment. 128K context. Strong reasoning per parameter.
Strengths
- MIT license
- 128K context
- Edge-class footprint
Weaknesses
- Heavy refusals
- Synthetic-data quirks
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 2.4 GB | 4 GB |
| Q8_0 | 4.1 GB | 5 GB |
Get the model
Ollama
One-line install
ollama run phi3.5:3.8bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Phi-3.5 Mini Instruct.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Phi-3.5 Mini Instruct?
Can I use Phi-3.5 Mini Instruct commercially?
What's the context length of Phi-3.5 Mini Instruct?
How do I install Phi-3.5 Mini Instruct with Ollama?
Source: huggingface.co/microsoft/Phi-3.5-mini-instruct
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.