Qwen 3 32B
Dense Qwen 3 32B. Best dense open-weight model in its size class at release; pairs nicely with a single RTX 5090 or 4090.
The new daily driver for RTX 3090 / 4090 / 5080 owners. Same VRAM footprint as Qwen 2.5 32B, materially better on reasoning thanks to thinking mode, similar speed in non-thinking. The right answer to "what runs on my 24 GB GPU?" today.
Strengths- 19 GB at Q4_K_M — full GPU offload on 24 GB with 16K context.
- Hybrid reasoning lifts hard-task quality past Qwen 2.5 32B without VRAM cost.
- Multilingual carryover still strong.
- Thinking-mode tokens cost real time — verbose intermediate reasoning eats throughput.
- License caps as before.
- Qwen 2.5 Coder 32B still beats it for coding — coder is a dedicated specialist.
- Q4_K_M (19.4 GB): 68–86 tok/s decode (non-thinking); same speed thinking, more tokens emitted
- Q5_K_M (22.9 GB): 56–70 tok/s
- Q8_0 (35 GB): partial offload, 18–24 tok/s
Yes, for 24 GB single-card owners who want the strongest dense model with hybrid reasoning. The new default daily driver. No, for dedicated coding workflows (pick Qwen 2.5 Coder 32B), or hard reasoning where QwQ 32B's specialization wins.
How it compares- vs Qwen 2.5 32B Instruct → Qwen 3 32B wins outright at the same VRAM. New work should default to Qwen 3.
- vs QwQ 32B → QwQ is the reasoning specialist; Qwen 3 32B is the generalist with optional reasoning. Pick QwQ for math/code reasoning, Qwen 3 32B for general chat.
- vs Llama 3.3 70B → Llama 3.3 70B is smarter but 3× slower on the same hardware. Qwen 3 32B is the productivity pick.
- vs Qwen 3 30B-A3B (MoE) → 30B-A3B is faster (~2× tok/s) due to MoE; Qwen 3 32B dense is steadier on instruction following.
ollama pull qwen3:32b
ollama run qwen3:32b
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090
›Why this rating
8.9/10 — the 32B-class evolution of the Qwen 3 thinking-mode story. Stronger absolute capability than Qwen 2.5 32B, runs in the same VRAM. Replaces 2.5 32B as the default for 24 GB single-card daily-driver use.
Overview
Dense Qwen 3 32B. Best dense open-weight model in its size class at release; pairs nicely with a single RTX 5090 or 4090.
Strengths
- Strongest dense ~30B model
- Apache 2.0
- Tool calling
Weaknesses
- Needs 24GB+ VRAM
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 19.0 GB | 24 GB |
| Q5_K_M | 22.0 GB | 28 GB |
| Q8_0 | 34.0 GB | 40 GB |
Get the model
Ollama
One-line install
ollama run qwen3:32bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Qwen 3 32B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Qwen 3 32B?
Can I use Qwen 3 32B commercially?
What's the context length of Qwen 3 32B?
How do I install Qwen 3 32B with Ollama?
Source: huggingface.co/Qwen/Qwen3-32B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.