Qwen family — local AI ecosystem · RunLocalAI

Start with Qwen 3 32B at Q4_K_M via Ollama — it fits on a single RTX 4090 24 GB and delivers MMLU 88.5%, GSM8K 94.2%, outperforming Llama 3.1 70B at less than half the VRAM. The 32B is Qwen's efficiency sweet spot: Apache 2.0 license with no MAU cap, strong bilingual English-Chinese performance, and best-in-class math scores (MATH 500 ~82%). If you have limited VRAM (<12 GB), use Qwen 3 8B Q4 — runs on MacBook Pro M4 Max at 22+ tok/s. Skip Qwen 3 235B MoE for first-time deployment — the expert offloading complexity is unnecessary for most workloads; the 32B dense handles 95% of use cases. Skip Qwen 2.5-Coder unless you specifically need code-first behavior — the base Qwen 3 models have improved code generation that obsoletes the dedicated coder variants for general use.

For single-user local: Ollama + qwen3:32b Q4_K_M on RTX 4090 24 GB or Apple M3 Ultra via MLX-LM. For multi-user serving: vLLM 0.6.5+ with AWQ 4-bit on 2× L40S — Qwen's GQA architecture enables efficient prefix caching at high concurrency. For MoE frontier: SGLang v0.2.5+ with the DeepSeek/Qwen MoE backend on 4× H100 SXM for Qwen 3 235B-A22B FP8 — ~8,000 tok/s at batch 64. For mobile: llama.cpp Qwen 3 8B Q4_0 on Snapdragon X Elite — ~15 tok/s. Always keep MoE router weights at FP16. Verify chat template is <|im_start|> format — Llama-format templates silently degrade Qwen instruction-following. See GPU buyer guide.

When it doesn't work

Qwen

Best entry point for local use

Deployment guidance

Featured models

Recommended runtimes

Related families

Related — keep moving