Qwen 2.5 Coder 32B Instruct
Coding-specialist Qwen 2.5. Beats GPT-4o on HumanEval and matches Sonnet on many code-edit benchmarks. The default local-coding model on 24GB cards.
The model to run if you want a Cursor / Copilot replacement on your own hardware. Qwen 2.5 Coder 32B is the headline open-weight coding model — strong fill-in-the-middle, strong repo-scale reasoning, fast enough on a 4090 to keep up with interactive editing.
Strengths- Fill-in-the-middle is genuinely good — the actual mechanism Cursor and Copilot rely on, not just chat-style code completion.
- Repo-aware reasoning — handles 32K-context code review tasks credibly; instruction-tuned to navigate multi-file context.
- 70–88 tok/s on 4090 Q4 — fast enough for interactive code-as-you-type once integrated with a properly streaming editor plugin.
- Qwen license MAU cap is a real concern for SaaS deployments.
- Lags closed models on novel-architecture tasks — anything genuinely outside its training distribution still falls back to plausible-but-wrong patterns.
- Repo-context isn't free — feeding a real codebase still requires good RAG or AST-aware chunking; the model alone won't fix bad context selection.
- Q4_K_M (19 GB): 70–88 tok/s decode, TTFT ~140 ms
- Q5_K_M (22.6 GB): 58–72 tok/s
- Q8_0 (35 GB): partial offload, 18–25 tok/s
Yes, for any developer with an RTX 3090 / 4090 / 5080+ who wants Copilot-class autocomplete without the cloud round-trip. The headline win for local AI. No, for developers comfortable with closed services for $10–20/month — for novel languages or rare frameworks, GPT-4 / Claude still produce more reliable code.
How it compares- vs DeepSeek Coder V2 Lite → Qwen 2.5 Coder 32B is meaningfully stronger; DeepSeek Coder V2 Lite (16B) is the right pick under 16 GB VRAM.
- vs Codestral 22B → Qwen 2.5 Coder 32B wins on capability; Codestral has cleaner Mistral license terms.
- vs Qwen 2.5 32B Instruct → Coder is dramatically better at coding; pick Instruct for general chat.
- vs DeepSeek V3 / R1 → V3 and R1 are stronger at hard reasoning but uncommonly large for single-card use.
ollama pull qwen2.5-coder:32b-instruct-q4_K_M
ollama run qwen2.5-coder:32b-instruct-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on 4090
Editor integration: Continue.dev or Tabby with Ollama backend
›Why this rating
9.2/10 — the strongest open-weight coding model that runs on a single 24 GB GPU. Genuinely competitive with closed coding models (GPT-4, Claude) on most non-frontier tasks. The only reason it loses points is the Qwen license MAU cap.
Overview
Coding-specialist Qwen 2.5. Beats GPT-4o on HumanEval and matches Sonnet on many code-edit benchmarks. The default local-coding model on 24GB cards.
Strengths
- Best open-weight coder at release
- Apache 2.0
- Strong fill-in-middle
Weaknesses
- Less strong on general chat than non-coder Qwen
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 19.0 GB | 24 GB |
| Q8_0 | 34.0 GB | 40 GB |
Get the model
Ollama
One-line install
ollama run qwen2.5-coder:32bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Qwen 2.5 Coder 32B Instruct.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Qwen 2.5 Coder 32B Instruct?
Can I use Qwen 2.5 Coder 32B Instruct commercially?
What's the context length of Qwen 2.5 Coder 32B Instruct?
How do I install Qwen 2.5 Coder 32B Instruct with Ollama?
Source: huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.