QwQ 32B Preview
Qwen team's reasoning-focused experimental release. Visible chain-of-thought in <think> tags. Precursor to Qwen 3's thinking mode.
QwQ 32B is the reasoning specialist of the Qwen 2.5 generation. Built for math, code planning, and multi-step problem solving. It always reasons (no toggle), so every output is preceded by visible chain-of-thought — useful when you want to see the model's work, expensive when you don't.
Strengths- State-of-the-art reasoning at 32B — surpasses Qwen 2.5 32B Instruct on GSM8K, MATH, HumanEval.
- Visible chain-of-thought is genuinely useful for verifying the model's logic on hard problems.
- Same VRAM footprint as Qwen 2.5 32B / Qwen 3 32B.
- Always reasons — every prompt eats 2–3× the tokens vs a non-reasoning model.
- Wastes capacity on simple prompts — using it for "write me a quick email" is a clear mismatch.
- Generalist quality is below Qwen 2.5 32B Instruct outside the reasoning sweet spot.
- Q4_K_M (19.4 GB): 68–86 tok/s decode — but 2–3× more tokens emitted per answer
- Q5_K_M (22.9 GB): 56–70 tok/s
- Q8_0 (35 GB): partial offload only
Yes, for math-heavy work, coding planning, scientific problem solving, agent loops where reasoning quality matters more than throughput. No, for general chat, drafting, or any high-throughput application — pick Qwen 3 32B with thinking mode toggled per turn instead.
How it compares- vs Qwen 3 32B → Qwen 3 32B with thinking is more flexible; QwQ 32B has slightly stronger reasoning but always pays the latency cost. Qwen 3 32B has largely subsumed QwQ for new deployments.
- vs DeepSeek R1 Distill Qwen 32B → R1 Distill is meaningfully stronger on the hardest reasoning tasks; QwQ holds up on most everyday math/code.
- vs Qwen 2.5 32B Instruct → QwQ is the reasoning specialist, Instruct is the generalist. Different jobs.
ollama pull qwq:32b-preview-q4_K_M
ollama run qwq:32b-preview-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4090
›Why this rating
8.7/10 — Alibaba's dedicated reasoning specialist in the 32B body. State-of-the-art for math/code reasoning at this size, but pays for it with verbose chain-of-thought that doubles latency on every prompt. Loses points to general-purpose models when reasoning isn't required.
Overview
Qwen team's reasoning-focused experimental release. Visible chain-of-thought in <think> tags. Precursor to Qwen 3's thinking mode.
Strengths
- Strong math and reasoning
- Apache 2.0
- Visible CoT
Weaknesses
- Verbose output
- Not great for chat
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 19.0 GB | 24 GB |
Get the model
Ollama
One-line install
ollama run qwq:32bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of QwQ 32B Preview.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run QwQ 32B Preview?
Can I use QwQ 32B Preview commercially?
What's the context length of QwQ 32B Preview?
How do I install QwQ 32B Preview with Ollama?
Source: huggingface.co/Qwen/QwQ-32B-Preview
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.