Llama 3.3 70B Instruct
Late-2024 refresh of the 70B Llama line. Roughly matches Llama 3.1 405B on most benchmarks at one-fifth the parameter count. The default high-end model for serious local inference on 48GB+ VRAM rigs.
The new ceiling for "what a serious local-AI hobbyist runs daily." Llama 3.3 70B closed almost the entire gap between open-weight 70B and frontier closed models for everyday work — chat, drafting, code review, RAG, multi-turn tool use. If you have an RTX 3090 / 4090 / 5090 / dual 3090s, this is your headline model.
Strengths- 70B-class quality at 32B-class effective requirements: Meta clearly continued training with reasoning + tool-use data. Outperforms Llama 3.1 70B noticeably on instruction following.
- Single-card runnable: Q4_K_M at 39 GB requires offload on a 24 GB card, but it works — ~22–28 tok/s with the right runner. On 32 GB+ (RTX 5090, dual cards) it's pure VRAM.
- License unchanged from 3.1: same permissive commercial terms, same broad ecosystem.
- Single-card VRAM is the bottleneck, not compute — even at Q4 you're partial-offloading on 24 GB. Q5+ is dual-card or workstation territory.
- No native vision: Meta kept multimodality on the 11B/90B vision branch. Use Llama 3.2 90B Vision or Pixtral if you need images.
- Knowledge cutoff is early-2024, which shows on current-events tasks; the local stack should pair it with web search or RAG for time-sensitive work.
- Q4_K_M (39 GB) — partial GPU offload: 22–28 tok/s decode, TTFT 350–500 ms on 1K prompt
- Q5_K_M (47 GB) — heavy CPU offload: 9–14 tok/s, only worth it on dual-card setups
- Q8_0 (70 GB) — workstation only (A6000 / dual 4090 / Mac Studio M2 Ultra)
Yes, for anyone who wants a near-frontier general model and is comfortable with 22–28 tok/s. The quality jump from 8B-class is enormous. No, for users on 16 GB or less VRAM (you'll partial-offload onto system RAM and watch tok/s collapse), or for high-throughput agent loops where speed matters more than ceiling quality.
How it compares- vs Llama 3.1 70B → 3.3 is a meaningful upgrade on instruction following, math, and multi-turn coherence; same VRAM footprint. Always pick 3.3.
- vs Qwen 2.5 72B → Qwen edges Llama on multilingual + raw knowledge; Llama edges Qwen on instruction reliability and tool-use. Both are fair picks; pick by license preference.
- vs DeepSeek R1 Distill Llama 70B → R1 Distill is dramatically better at reasoning tasks (math, code, planning); base Llama 3.3 is better at general chat and writing. Run both side by side if disk allows.
- vs Mixtral 8x22B → Mixtral has sparser compute but heavier total VRAM (~84 GB Q4) and is now behind on quality. Llama 3.3 70B replaces Mixtral 8x22B for almost every workload.
ollama pull llama3.3:70b-instruct-q4_K_M
ollama run llama3.3:70b-instruct-q4_K_M
Settings used in the timing range above
Quant: Q4_K_M GGUF
Context: 8192 (--n-gpu-layers 65 of 81)
Backend: llama.cpp via Ollama, CUDA 12.4
GPU: RTX 4090, 64 GB DDR5 system RAM
›Why this rating
9.1/10 — the best open-weight model you can run on a single 24 GB consumer GPU, and the highest-quality general-purpose local model period for anyone willing to live with 22–28 tok/s. Loses fractional points only because frontier closed models are still ahead on hard reasoning.
Overview
Late-2024 refresh of the 70B Llama line. Roughly matches Llama 3.1 405B on most benchmarks at one-fifth the parameter count. The default high-end model for serious local inference on 48GB+ VRAM rigs.
Strengths
- Top-tier reasoning at 70B size
- 128K context
- Production-grade tool calling
Weaknesses
- Needs 48GB+ VRAM at Q4
- Slow on memory-bandwidth-bound consumer cards
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 40.0 GB | 48 GB |
| Q5_K_M | 47.0 GB | 56 GB |
| Q8_0 | 70.0 GB | 80 GB |
Get the model
Ollama
One-line install
ollama run llama3.3:70bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Llama 3.3 70B Instruct.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Llama 3.3 70B Instruct?
Can I use Llama 3.3 70B Instruct commercially?
What's the context length of Llama 3.3 70B Instruct?
How do I install Llama 3.3 70B Instruct with Ollama?
Source: huggingface.co/meta-llama/Llama-3.3-70B-Instruct
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.