qwen
72B parameters
Commercial OK

Qwen 2.5 72B Instruct

The flagship of Qwen 2.5. Workstation-tier; needs 48GB+ VRAM for usable inference.

License: Qwen License·Released Sep 19, 2024·Context: 131,072 tokens
Our verdict
By Fredoline Eruo·Last verified May 6, 2026
9.0/10
Positioning

The Qwen flagship in the dense-72B class. If you have a 4090 / 5090 / RTX 6000 Ada or are on Apple Silicon with 64 GB+ unified memory, this competes with Llama 3.3 70B as the best general open-weight model available.

Strengths
  • Multilingual ceiling is the highest in the open 70B-class — Chinese, Korean, Japanese, German all near-frontier quality.
  • Long-context behavior holds up well out to 64K in practice.
  • Math and code are strong — better than Llama 3.1 70B base; close to Llama 3.3 70B.
Limitations
  • Same VRAM constraints as Llama 3.3 70B — Q4 partial-offload on 24 GB.
  • License caps at 100M MAU — review for scale deployments.
  • Refusal behavior on geopolitical content can be limiting depending on use case.
Real-world performance on RTX 4090
  • Q4_K_M (40 GB) — partial offload: 21–27 tok/s decode, TTFT ~400 ms
  • Q5_K_M (47 GB) — heavier offload: 9–13 tok/s
  • Q8_0 (72 GB) — workstation only
Should you run this locally?

Yes, for users who want the best multilingual local model and have the same hardware that runs Llama 3.3 70B. No, for English-only workloads where Llama 3.3 70B's instruction-following polish is preferable.

How it compares
  • vs Llama 3.3 70B → coin flip on English; Qwen wins decisively on non-English. Pick by language mix.
  • vs Llama 3.1 70B → Qwen 2.5 72B wins outright; Llama 3.1 70B is the previous-generation comparison.
  • vs Qwen 2.5 32B → 72B is meaningfully smarter on hard tasks; 32B is faster and full-GPU. Pick by speed-vs-quality preference.
  • vs DeepSeek R1 Distill Llama 70B → R1 Distill is dramatically better at reasoning; Qwen 2.5 72B wins at general chat and writing.
Run this yourself
ollama pull qwen2.5:72b-instruct-q4_K_M
ollama run qwen2.5:72b-instruct-q4_K_M
Settings: Q4_K_M GGUF, 8192 ctx, --n-gpu-layers 60 of 81, RTX 4090
Why this rating

9.0/10 — neck-and-neck with Llama 3.3 70B for "best general open-weight model that runs on a single 24 GB card with offload." Wins on multilingual, loses on instruction polish.

Overview

The flagship of Qwen 2.5. Workstation-tier; needs 48GB+ VRAM for usable inference.

Strengths

  • Top open weights at 72B
  • Strong multilingual

Weaknesses

  • License has commercial-use revenue cap
  • 48GB+ VRAM

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M41.0 GB48 GB
Q5_K_M49.0 GB56 GB

Get the model

Ollama

One-line install

ollama run qwen2.5:72bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/Qwen/Qwen2.5-72B-Instruct

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Qwen 2.5 72B Instruct.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Qwen 2.5 72B Instruct?

48GB of VRAM is enough to run Qwen 2.5 72B Instruct at the Q4_K_M quantization (file size 41.0 GB). Higher-quality quantizations need more.

Can I use Qwen 2.5 72B Instruct commercially?

Yes — Qwen 2.5 72B Instruct ships under the Qwen License, which permits commercial use. Always read the license text before deployment.

What's the context length of Qwen 2.5 72B Instruct?

Qwen 2.5 72B Instruct supports a context window of 131,072 tokens (about 131K).

How do I install Qwen 2.5 72B Instruct with Ollama?

Run `ollama pull qwen2.5:72b` to download, then `ollama run qwen2.5:72b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/Qwen/Qwen2.5-72B-Instruct

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.