Build a local reasoning-model stack (May 2026)
Run a reasoning-class model locally for math, code synthesis, multi-step analysis, and long-horizon problem-solving. Honest about the reasoning-token cost (extra 200-2000 tokens per query) and the hardware requirements that follow.
- 01ModelPrimary reasoning modeldeepseek-r1-distill-qwen-32b
DeepSeek R1 Distill Qwen 32B is the reasoning model that actually runs at 24GB VRAM via AWQ-INT4. Stronger reasoning quality per parameter than the full DeepSeek R1 (which needs ~700GB and is impossible locally). Distill gives ~80% of R1's reasoning at 5% of the VRAM.
- 02ModelAlternative reasoning model (Qwen-team)qwq-32b
QwQ 32B is the Qwen team's open reasoning model. Slightly different reasoning style than DeepSeek R1 — pick QwQ when you want shorter reasoning blocks and faster wall-clock answers; pick DeepSeek R1 Distill when you want longer, more thorough reasoning.
- 03ModelGeneral model with reasoning toggleqwen-3-32b
Qwen 3 32B has a reasoning-mode toggle (the <think> block convention) that you can enable per-query. Useful when most of your workload doesn't need reasoning — fall back to standard mode for chat, enable thinking for math / code / analysis.
- 04ToolInference enginevllm
vLLM over Ollama for reasoning models: continuous batching matters because reasoning-token emissions are long (a single query can emit 5000+ tokens). Prefix caching helps when batch reasoning-mode queries share system prompts. KV-cache management matters more here than on chat models.
- 05ToolFrontend with reasoning-block renderingopenwebui
Open WebUI renders <think> blocks as collapsible reasoning sections — the right UX for reasoning models. The user sees the conclusion first, can expand to inspect the reasoning. Cleaner than a wall of thinking tokens.
- 06HardwareGPU (minimum tier for 32B AWQ + 32K context)rtx-4090
RTX 4090 24GB is the floor. 32B AWQ + 32K context fits with ~2GB headroom — enough for reasoning-block emission but tight. The 5090 32GB is the comfortable tier; M3 Max 64GB / M4 Max are credible alternatives via MLX-LM.
Why reasoning models change the calculus
Reasoning models — DeepSeek R1 family, QwQ, Qwen 3 in thinking mode — emit <think> blocks before their actual answers. These blocks contain the model's internal chain of thought, often 200-2000 tokens of intermediate reasoning that's never shown to the user. The architectural reality this stack respects: reasoning models cost 2-5x more tokens per query than chat models, and the latency budget shifts accordingly.
For some workloads, this is dramatically worth it. Math problems, multi-step code synthesis, complex analysis tasks — reasoning models often beat chat models of the same parameter count by 20-40% on these specific benchmarks. For other workloads (chat, simple code edits, summarization), the reasoning-token tax is pure overhead and a chat model is the right pick.
The headline architectural choice this stack makes: 32B-class distilled reasoning models, not the frontier full models. The full DeepSeek R1 needs ~700GB of weights — impossible locally without a multi-machine cluster. The Distill Qwen 32B variant captures ~80% of the reasoning quality at 5% of the VRAM, fits an RTX 4090 in AWQ-INT4, and runs at 30-40 tok/s. That's the realistic local-reasoning-model tier.
Step-by-step setup
1. Bring up vLLM with DeepSeek R1 Distill Qwen 32B
# AWQ-INT4 fits 24GB with 32K context — but with reasoning-block
# emission, KV cache fills quickly. Conservative settings:
docker run --gpus all --rm -d --name vllm \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--restart unless-stopped \
vllm/vllm-openai:v0.17.1 \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B-AWQ \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--enable-chunked-prefill \
--enforce-eager--gpu-memory-utilization 0.85 rather than 0.9 because reasoning-block emission produces longer outputs than chat models — the KV cache needs more headroom. The --enforce-eager flag avoids CUDA graph compilation issues that some R1 distill versions trigger.
2. Optional — also load Qwen 3 32B for reasoning-toggle workflows
# Qwen 3 32B has a reasoning-mode toggle. Run as a second vLLM
# instance on a different port if you have headroom (or swap it in
# when needed):
docker run --gpus all --rm -d --name vllm-qwen3 \
-p 8001:8000 \
vllm/vllm-openai:v0.17.1 \
--model Qwen/Qwen3-32B-AWQ \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--enable-chunked-prefill
# NOTE: only run this if you have a 5090 (32GB) or larger.
# Two 32B AWQ models do not fit on a 4090.3. Wire Open WebUI as the reasoning-aware frontend
docker run -d --name open-webui \
-p 3000:8080 \
--restart unless-stopped \
-v open-webui:/app/backend/data \
--add-host=host.docker.internal:host-gateway \
-e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
-e OPENAI_API_KEYS="any-string" \
ghcr.io/open-webui/open-webui:latestOpen WebUI renders <think> blocks as collapsible sections — the user sees the conclusion first; can expand to inspect the reasoning. This UX pattern is what makes reasoning models actually usable in chat. Without it, the user sees a wall of thinking tokens before the answer.
4. Configure for reasoning-aware sampling
# Reasoning models have specific sampling recommendations:
# - temperature 0.6-0.8 (higher than chat for diverse reasoning paths)
# - top_p 0.95
# - presence_penalty 0 (don't penalize reasoning-token repetition)
# - frequency_penalty 0
# In Open WebUI, set these as workspace defaults for the reasoning
# models. The UI exposes all four sliders.
# For code synthesis tasks specifically, lower temperature to 0.2
# during the answer block while keeping it 0.8 during the reasoning
# block — this is a manual workflow vLLM doesn't support natively;
# OpenWebUI's per-message-temperature feature is the workaround.The reasoning-token tax
The cost reasoning models impose on every query, with honest numbers:
- Trivial questions (“what's the capital of France”): reasoning models still emit 100-300 thinking tokens. Pure overhead; use a chat model.
- Standard chat / explanation: 300-800 thinking tokens. Marginal benefit vs chat models; sometimes worth it for the polish.
- Math problems / step-by-step analysis: 500-1500 thinking tokens. Strong benefit vs chat models on correctness.
- Complex code synthesis / architecture decisions: 1000-3000 thinking tokens. Substantial benefit; this is where reasoning models earn their tax.
- Multi-step planning / proof construction: 2000-5000+ thinking tokens. The tier where reasoning models genuinely outperform anything else available locally.
The cost in wall-clock time on RTX 4090: each thinking token costs ~25-30ms (the same throughput as answer tokens). A query with 2000 thinking tokens adds ~50-60 seconds before the actual answer starts streaming. Plan UX accordingly — show progress; never block-wait silently.
Failure modes you'll hit
- Reasoning blocks leak into structured output. Some clients parse model output as JSON; the
<think>block breaks the parse. Strip thinking tokens before structured-output parsing, or instruct the model to skip reasoning when emitting JSON. - Context-window exhaustion on long reasoning. Complex tasks can emit 5000+ thinking tokens. With 32K context and a 4K input prompt, that leaves ~23K tokens for reasoning + answer. Most queries fit; pathological cases don't. Use a reasoning model with longer context if you hit this regularly.
- OOM on KV cache during reasoning. KV cache scales with output length. Long thinking blocks blow past VRAM budgets sized for chat-class outputs. Set conservative
gpu_memory_utilization(0.85 not 0.9). - QwQ vs DeepSeek R1 reasoning style mismatch. QwQ's reasoning is shorter and more direct; DeepSeek R1's is more thorough. Switching between them mid- conversation produces inconsistent UX. Pick one; stick with it per workflow.
- Sampler config drift. Reasoning models are more sensitive to sampler parameters than chat models. A temperature of 1.0 (chat default) often produces incoherent reasoning. Use 0.6-0.8.
- Tool-calling format confusion. Reasoning models trained on chain-of-thought sometimes emit reasoning inside tool-call JSON, breaking the parse. Newer reasoning-tuned models handle this; older ones don't. Test with your specific tool-calling client.
- Premature stopping on EOS during reasoning. Some configs treat </think> as a stop token. Verify stop-token list excludes reasoning-block delimiters.
Variations and alternatives
Apple Silicon variation. Replace vLLM + 4090 with MLX-LM + M3-M4 Max 64GB. The unified-memory architecture handles reasoning-block emissions well; long-context throughput stays stable. Pick this when you're Apple-native; expect ~30% throughput drop.
SGLang variation. Replace vLLM with SGLang if you process many reasoning queries with shared system prompts (batch reasoning workflows). RadixAttention's prefix tree compounds reasoning-mode wins.
Higher-VRAM variation. RTX 5090 32GB or dual-RTX-4090 (TP=2) lets you run two reasoning models simultaneously — DeepSeek R1 Distill for thorough reasoning, QwQ for fast reasoning. Switch per query type.
Cloud-API hybrid. Use full DeepSeek R1 (not Distill) via DeepSeek's API for the hardest tasks; fall back to local Distill for the routine ones. Open WebUI's provider abstraction makes the dual-backend pattern natural.
Who should avoid this stack
- Anyone whose workload is mostly chat or simple tasks. Reasoning-token tax is pure overhead. Use the workstation stack instead.
- Anyone with strict latency budgets. Reasoning models add 50-300% to wall-clock time. If sub-second response is required, chat models or smaller reasoning models (7B-class) are the only viable option.
- Anyone on 16GB VRAM. 32B reasoning models don't fit. Drop to 14B-class reasoning (less capable but still useful) or use API.
- Anyone whose tool-calling client doesn't handle reasoning blocks. Older agent harnesses break on
<think>blocks. Verify compatibility before committing.
Going deeper
- DeepSeek R1 Distill Qwen 32B catalog entry — model architecture, reasoning quality benchmarks.
- QwQ 32B catalog entry — the Qwen-team alternative.
- vLLM operational review — the runtime-specific operator detail (KV cache management, chunked prefill).
- Inference runtime ecosystem map — full landscape with the alternatives.
- RTX 4090 workstation stack — the chat-model alternative on the same hardware.