mixtral
47B parameters
Commercial OK

Mixtral 8x7B Instruct

The MoE model that introduced the 8-experts pattern to the open-weight world. 47B params total, 13B active. Still a viable workhorse on 36GB+ setups.

License: Apache 2.0·Released Dec 11, 2023·Context: 32,768 tokens
Our verdict
By Fredoline Eruo·Last verified May 6, 2026
6.4/10
Positioning

The first practical MoE model in local AI. Today it's stuck in an awkward middle: the routing means it activates only 13B parameters per token (fast for its size), but you still need to fit all 47B in VRAM (26 GB at Q4_K_M). The model that should beat it on both axes — Llama 3.3 70B at Q4 — exists and runs in similar memory with offload.

Strengths
  • Active-parameter speed: 28–35 tok/s on a 4090 at Q4 (with offload), notably faster than dense 47B equivalents.
  • Apache 2.0 license — clean commercial story.
  • Strong multilingual for its era; French and German specifically remain solid.
Limitations
  • VRAM-heavy for the active compute: you pay 26 GB to use 13B worth of compute per token. Bad memory-vs-quality tradeoff today.
  • Routing instability at long contexts — output quality degrades noticeably past 16K.
  • Beat by Llama 3.3 70B on almost every general benchmark while needing similar memory.
Real-world performance on RTX 4090
  • Q4_K_M (26 GB) — partial offload on 24 GB: 28–35 tok/s decode, TTFT ~250 ms on 1K prompt
  • Q5_K_M (33 GB) — heavy offload: 14–20 tok/s
  • Q8_0 (47 GB) — workstation territory only
Should you run this locally?

Yes, for legacy fine-tunes you depend on, or where the Apache 2.0 license is required and you need MoE speed characteristics. No, for new deployments — Llama 3.3 70B at Q4_K_M lives in similar memory and produces meaningfully better outputs.

How it compares
  • vs Llama 3.3 70B Q4 → similar VRAM footprint, Llama 3.3 wins on quality across general tasks. The MoE speed advantage is real (~25% faster) but the quality gap is larger.
  • vs Mixtral 8x22B → 8x22B is the modern Mixtral pick; uses ~3× the VRAM but earns it on quality.
  • vs Qwen 3 30B-A3B (MoE) → Qwen 3 30B-A3B does what Mixtral 8x7B promised: smaller VRAM (~17 GB Q4), tighter routing, better quality. Pick Qwen 3 30B-A3B if you want MoE speed today.
Run this yourself
ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M
ollama run mixtral:8x7b-instruct-v0.1-q4_K_M
Settings: Q4_K_M GGUF, 8192 ctx, --n-gpu-layers 24 of 33 on 4090, CUDA 12.4
Why this rating

6.4/10 — the original sparse-MoE story for local was important, but the math no longer pencils out: Llama 3.1 70B uses similar VRAM and is materially better, and Mixtral 8x22B is a more credible MoE option for dense workloads.

Overview

The MoE model that introduced the 8-experts pattern to the open-weight world. 47B params total, 13B active. Still a viable workhorse on 36GB+ setups.

Strengths

  • Apache 2.0
  • Pioneer MoE
  • Wide ecosystem support

Weaknesses

  • Now outpaced by Qwen 3 30B-A3B

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M28.0 GB32 GB
Q5_K_M33.0 GB38 GB

Get the model

Ollama

One-line install

ollama run mixtral:8x7bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

Source repository — direct quantization required.

Benchmarks

Real measurements on real hardware. Numbers ship with the runner version, quant, and date.

1 run on record
HardwareConf.QuantCtxTokens / secVRAMTTFTDate
NVIDIA GeForce RTX 4090(Ollama)MQ4_K_M8K
31.4tok/s
23.1 GB248 msApr 23, 26

Hardware that runs this

Cards with enough VRAM for at least one quantization of Mixtral 8x7B Instruct.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Mixtral 8x7B Instruct?

32GB of VRAM is enough to run Mixtral 8x7B Instruct at the Q4_K_M quantization (file size 28.0 GB). Higher-quality quantizations need more.

Can I use Mixtral 8x7B Instruct commercially?

Yes — Mixtral 8x7B Instruct ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Mixtral 8x7B Instruct?

Mixtral 8x7B Instruct supports a context window of 32,768 tokens (about 33K).

How do I install Mixtral 8x7B Instruct with Ollama?

Run `ollama pull mixtral:8x7b` to download, then `ollama run mixtral:8x7b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.