Build a local reasoning-model stack (May 2026) — DeepSeek R1 Distill + QwQ + Qwen 3 + vLLM

Why reasoning models change the calculus

Reasoning models — DeepSeek R1 family, QwQ, Qwen 3 in thinking mode — emit <think> blocks before their actual answers. These blocks contain the model's internal chain of thought, often 200-2000 tokens of intermediate reasoning that's never shown to the user. The architectural reality this stack respects: reasoning models cost 2-5x more tokens per query than chat models, and the latency budget shifts accordingly.

For some workloads, this is dramatically worth it. Math problems, multi-step code synthesis, complex analysis tasks — reasoning models often beat chat models of the same parameter count by 20-40% on these specific benchmarks. For other workloads (chat, simple code edits, summarization), the reasoning-token tax is pure overhead and a chat model is the right pick.

The headline architectural choice this stack makes: 32B-class distilled reasoning models, not the frontier full models. The full DeepSeek R1 needs ~700GB of weights — impossible locally without a multi-machine cluster. The Distill Qwen 32B variant captures ~80% of the reasoning quality at 5% of the VRAM, fits an RTX 4090 in AWQ-INT4, and runs at 30-40 tok/s. That's the realistic local-reasoning-model tier.

Step-by-step setup

1. Bring up vLLM with DeepSeek R1 Distill Qwen 32B

# AWQ-INT4 fits 24GB with 32K context — but with reasoning-block
# emission, KV cache fills quickly. Conservative settings:
docker run --gpus all --rm -d --name vllm \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --restart unless-stopped \
  vllm/vllm-openai:v0.17.1 \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B-AWQ \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --enforce-eager

--gpu-memory-utilization 0.85 rather than 0.9 because reasoning-block emission produces longer outputs than chat models — the KV cache needs more headroom. The --enforce-eager flag avoids CUDA graph compilation issues that some R1 distill versions trigger.

2. Optional — also load Qwen 3 32B for reasoning-toggle workflows

# Qwen 3 32B has a reasoning-mode toggle. Run as a second vLLM
# instance on a different port if you have headroom (or swap it in
# when needed):
docker run --gpus all --rm -d --name vllm-qwen3 \
  -p 8001:8000 \
  vllm/vllm-openai:v0.17.1 \
  --model Qwen/Qwen3-32B-AWQ \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --enable-chunked-prefill

# NOTE: only run this if you have a 5090 (32GB) or larger.
# Two 32B AWQ models do not fit on a 4090.

3. Wire Open WebUI as the reasoning-aware frontend

docker run -d --name open-webui \
  -p 3000:8080 \
  --restart unless-stopped \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
  -e OPENAI_API_KEYS="any-string" \
  ghcr.io/open-webui/open-webui:latest

Open WebUI renders <think> blocks as collapsible sections — the user sees the conclusion first; can expand to inspect the reasoning. This UX pattern is what makes reasoning models actually usable in chat. Without it, the user sees a wall of thinking tokens before the answer.

4. Configure for reasoning-aware sampling

# Reasoning models have specific sampling recommendations:
# - temperature 0.6-0.8 (higher than chat for diverse reasoning paths)
# - top_p 0.95
# - presence_penalty 0 (don't penalize reasoning-token repetition)
# - frequency_penalty 0

# In Open WebUI, set these as workspace defaults for the reasoning
# models. The UI exposes all four sliders.

# For code synthesis tasks specifically, lower temperature to 0.2
# during the answer block while keeping it 0.8 during the reasoning
# block — this is a manual workflow vLLM doesn't support natively;
# OpenWebUI's per-message-temperature feature is the workaround.

The reasoning-token tax

The cost reasoning models impose on every query, with honest numbers:

Trivial questions (“what's the capital of France”): reasoning models still emit 100-300 thinking tokens. Pure overhead; use a chat model.
Standard chat / explanation: 300-800 thinking tokens. Marginal benefit vs chat models; sometimes worth it for the polish.
Math problems / step-by-step analysis: 500-1500 thinking tokens. Strong benefit vs chat models on correctness.
Complex code synthesis / architecture decisions: 1000-3000 thinking tokens. Substantial benefit; this is where reasoning models earn their tax.
Multi-step planning / proof construction: 2000-5000+ thinking tokens. The tier where reasoning models genuinely outperform anything else available locally.

The cost in wall-clock time on RTX 4090: each thinking token costs ~25-30ms (the same throughput as answer tokens). A query with 2000 thinking tokens adds ~50-60 seconds before the actual answer starts streaming. Plan UX accordingly — show progress; never block-wait silently.

Failure modes you'll hit

Reasoning blocks leak into structured output. Some clients parse model output as JSON; the <think> block breaks the parse. Strip thinking tokens before structured-output parsing, or instruct the model to skip reasoning when emitting JSON.
Context-window exhaustion on long reasoning. Complex tasks can emit 5000+ thinking tokens. With 32K context and a 4K input prompt, that leaves ~23K tokens for reasoning + answer. Most queries fit; pathological cases don't. Use a reasoning model with longer context if you hit this regularly.
OOM on KV cache during reasoning. KV cache scales with output length. Long thinking blocks blow past VRAM budgets sized for chat-class outputs. Set conservativegpu_memory_utilization (0.85 not 0.9).
QwQ vs DeepSeek R1 reasoning style mismatch. QwQ's reasoning is shorter and more direct; DeepSeek R1's is more thorough. Switching between them mid- conversation produces inconsistent UX. Pick one; stick with it per workflow.
Sampler config drift. Reasoning models are more sensitive to sampler parameters than chat models. A temperature of 1.0 (chat default) often produces incoherent reasoning. Use 0.6-0.8.
Tool-calling format confusion. Reasoning models trained on chain-of-thought sometimes emit reasoning inside tool-call JSON, breaking the parse. Newer reasoning-tuned models handle this; older ones don't. Test with your specific tool-calling client.
Premature stopping on EOS during reasoning. Some configs treat </think> as a stop token. Verify stop-token list excludes reasoning-block delimiters.

Variations and alternatives

Apple Silicon variation. Replace vLLM + 4090 with MLX-LM + M3-M4 Max 64GB. The unified-memory architecture handles reasoning-block emissions well; long-context throughput stays stable. Pick this when you're Apple-native; expect ~30% throughput drop.

SGLang variation. Replace vLLM with SGLang if you process many reasoning queries with shared system prompts (batch reasoning workflows). RadixAttention's prefix tree compounds reasoning-mode wins.

Higher-VRAM variation. RTX 5090 32GB or dual-RTX-4090 (TP=2) lets you run two reasoning models simultaneously — DeepSeek R1 Distill for thorough reasoning, QwQ for fast reasoning. Switch per query type.

Cloud-API hybrid. Use full DeepSeek R1 (not Distill) via DeepSeek's API for the hardest tasks; fall back to local Distill for the routine ones. Open WebUI's provider abstraction makes the dual-backend pattern natural.

Who should avoid this stack

Anyone whose workload is mostly chat or simple tasks. Reasoning-token tax is pure overhead. Use the workstation stack instead.
Anyone with strict latency budgets. Reasoning models add 50-300% to wall-clock time. If sub-second response is required, chat models or smaller reasoning models (7B-class) are the only viable option.
Anyone on 16GB VRAM. 32B reasoning models don't fit. Drop to 14B-class reasoning (less capable but still useful) or use API.
Anyone whose tool-calling client doesn't handle reasoning blocks. Older agent harnesses break on <think> blocks. Verify compatibility before committing.

Going deeper

DeepSeek R1 Distill Qwen 32B catalog entry — model architecture, reasoning quality benchmarks.
QwQ 32B catalog entry — the Qwen-team alternative.
vLLM operational review — the runtime-specific operator detail (KV cache management, chunked prefill).
Inference runtime ecosystem map — full landscape with the alternatives.
RTX 4090 workstation stack — the chat-model alternative on the same hardware.