DeepSeek R1 Distill Llama 70B
Reasoning distillation onto Llama 3.3 70B. Best-in-class open-weight reasoner you can actually fit on a workstation.
The most important practical-local model release of 2025. R1 Distill Llama 70B takes R1's reasoning training and applies it to Llama 3.3 70B — you keep the runs-on-a-4090 footprint and gain dramatic reasoning capability. This is the model RTX 3090 / 4090 / 5090 owners should be running for hard problems.
Strengths- Frontier-adjacent reasoning in a 70B-class footprint that runs locally.
- Same Llama 3.3 70B VRAM at Q4 — no new hardware needed if you already run Llama 3.3 70B.
- Llama license carries through — same permissive commercial terms.
- Verbose chain-of-thought — 2–3× token cost vs base Llama 3.3 70B.
- Generalist quality slightly below base Llama 3.3 70B on simple tasks — pure reasoning training has a small everyday-chat tax.
- Same partial-offload speeds as Llama 3.3 70B on 24 GB cards (22–28 tok/s).
- Q4_K_M (39 GB) — partial offload: 21–27 tok/s decode, but 2–3× tokens per answer
- Q5_K_M (47 GB) — heavy offload: 9–13 tok/s
- Q8_0 (70 GB) — workstation only
Yes, for anyone who has a single 24 GB card and wants near-frontier reasoning. The most important reasoning-model decision in local AI right now. No, for users on smaller cards (use R1 Distill Qwen 14B instead) or for general chat (base Llama 3.3 70B is faster and equally capable on simple prompts).
How it compares- vs Llama 3.3 70B (base) → R1 Distill wins decisively on reasoning; base wins on simple-task throughput. Run both side by side if disk allows.
- vs DeepSeek R1 (full) → full R1 has higher ceiling but requires workstation; Distill 70B captures ~80% of the lift on consumer hardware.
- vs QwQ 32B → R1 Distill 70B wins decisively on reasoning quality; QwQ 32B wins on speed (full GPU on 24 GB).
- vs DeepSeek R1 Distill Qwen 32B → 70B Llama is smarter; 32B Qwen is faster (full GPU). Pick by speed-vs-quality.
ollama pull deepseek-r1:70b-distill-llama-q4_K_M
ollama run deepseek-r1:70b-distill-llama-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, --n-gpu-layers 65 of 81, RTX 4090 + 64 GB RAM
›Why this rating
9.0/10 — the headline practical-local model of the R1 release. Distills R1's reasoning into the Llama 3.3 70B body — runs in 39 GB Q4, same hardware as Llama 3.3 70B, but with reasoning quality that genuinely approaches frontier levels. The right pick for "I want o1-class reasoning on a single 24 GB card."
Overview
Reasoning distillation onto Llama 3.3 70B. Best-in-class open-weight reasoner you can actually fit on a workstation.
Strengths
- MIT license
- Top reasoning at 70B
- Approachable on dual 24GB
Weaknesses
- Slower than non-reasoning 70B
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 40.0 GB | 48 GB |
| Q5_K_M | 47.0 GB | 56 GB |
Get the model
Ollama
One-line install
ollama run deepseek-r1:70bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of DeepSeek R1 Distill Llama 70B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run DeepSeek R1 Distill Llama 70B?
Can I use DeepSeek R1 Distill Llama 70B commercially?
What's the context length of DeepSeek R1 Distill Llama 70B?
How do I install DeepSeek R1 Distill Llama 70B with Ollama?
Source: huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.