RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
  1. >
  2. Home
  3. /Models
  4. /Qwen 3.6 35B-A3B (MTP)
qwen
35B parameters
Commercial OK
·Reviewed May 2026

Qwen 3.6 35B-A3B (MTP)

Qwen 3.6 35B-A3B with Multi-Token Prediction (MTP). The "A3B" suffix means ~3B activated parameters per token via Mixture-of-Experts — inference cost stays mid-tier while total parameter count climbs to 35B. MTP enables the model to predict multiple tokens per forward pass, materially speeding up generation throughput on supported runtimes (vLLM 0.20+, llama.cpp recent builds). Currently trending #1 on HuggingFace via unsloth's GGUF quantizations.

License: Apache-2.0·Released May 11, 2026·Context: 262,144 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED MAY 15, 2026
8.0/10

Positioning

Qwen 3.6 35B-A3B (MTP) is a Mixture-of-Experts (MoE) large language model from Alibaba's Qwen team, released under the permissive Apache-2.0 license. Its architecture activates roughly 3 billion parameters per token while storing 35 billion total, meaning inference cost is closer to a dense 3B-parameter model than a dense 35B one. The addition of Multi-Token Prediction (MTP) allows the model to generate multiple tokens per forward pass, which can materially speed up throughput on compatible runtimes like vLLM 0.20+ and recent llama.cpp builds. With a 262,144-token context window, it targets high-throughput local deployment on workstation-class hardware.

Strengths

  • MoE efficiency with large total capacity. With ~3B activated parameters per token, the model offers the representational capacity of a 35B dense model while keeping per-token compute costs low — ideal for latency-sensitive or high-throughput scenarios.

  • Multi-Token Prediction for faster generation. MTP enables predicting several tokens in a single forward pass, reducing the number of autoregressive steps. This can significantly improve generation throughput on supported inference engines.

  • Permissive Apache-2.0 license. Unlike many open-weight models with restrictive licenses, Apache-2.0 allows commercial use, modification, and redistribution without additional fees, making it suitable for enterprise deployment.

  • Large 262K context window. The 262,144-token context enables processing long documents, codebases, or multi-turn conversations without truncation, a key advantage for retrieval-augmented generation and complex reasoning tasks.

Limitations

  • High memory requirements at full precision. At FP16, the model requires ~70 GB of disk space, and with KV cache overhead for long contexts, total GPU memory demand can exceed 100 GB — necessitating multi-GPU setups or aggressive quantization.

  • Dependency on MTP-optimized runtimes. The throughput benefits of Multi-Token Prediction are only realized on inference engines that explicitly support it (e.g., vLLM 0.20+, recent llama.cpp). On standard runtimes, MTP may fall back to single-token prediction, negating the advantage.

  • No community-verified benchmark data available. As a newly released model, independent evaluations on standard benchmarks (e.g., reasoning, coding, instruction following) are not yet available. Published vendor metrics should be treated as best-case until confirmed by third parties.

  • Quantization trade-offs at lower bit widths. While Q4_K_M (19.7 GB) and Q2_K (11.4 GB) make the model more accessible, aggressive quantization can degrade output quality, especially for nuanced tasks. Operators should test quantized versions against their use case.

What it takes to run this locally

Quantized model file sizes (GGUF): FP16 ~70 GB, Q8_0 ~37 GB, Q6_K ~28.9 GB, Q5_K_M ~24.9 GB, Q4_K_M ~19.7 GB, Q3_K_M ~17.1 GB, Q2_K ~11.4 GB. Add roughly 30–50% for KV cache and framework overhead at typical context lengths. Deployment class: workstation — a single 48 GB GPU (e.g., RTX 6000 Ada) can run Q4_K_M or Q3_K_M with moderate context; dual 24 GB GPUs (e.g., RTX 4090) can handle Q4_K_M with larger contexts. For full FP16 precision or maximum context, datacenter GPUs (A100 80 GB, H100) are recommended.

Should you run this locally?

Yes if: you need a permissively licensed MoE model with high throughput potential, have access to workstation-class multi-GPU hardware, and can leverage MTP-optimized runtimes. The model is especially attractive for commercial applications where Apache-2.0 simplifies licensing.

No if: you are limited to a single consumer GPU (12–24 GB VRAM) — even Q2_K may struggle with long contexts; or if your inference stack does not support MTP, as the architecture's key advantage goes unused. In those cases, a dense 7B–14B model may be more practical.

Catalog cross-links

  • Qwen family overview
  • GGUF quantization guide
  • vLLM inference engine

Overview

Qwen 3.6 35B-A3B with Multi-Token Prediction (MTP). The "A3B" suffix means ~3B activated parameters per token via Mixture-of-Experts — inference cost stays mid-tier while total parameter count climbs to 35B. MTP enables the model to predict multiple tokens per forward pass, materially speeding up generation throughput on supported runtimes (vLLM 0.20+, llama.cpp recent builds). Currently trending #1 on HuggingFace via unsloth's GGUF quantizations.

How to run it

Recommended runtime: vLLM 0.20+ (best MTP perf) or llama.cpp post-b9148 (best CPU compatibility). For Ollama users, the unsloth/Qwen3.6-35B-A3B-MTP-GGUF Q4_K_M is the practical default — pull with ollama pull hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:Q4_K_M. VRAM math: model file ~20GB at Q4, KV cache ~3-5GB at 16K context with FP16 — comfortable on a 24GB GPU; tight on 16GB.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (qwen-3-6)
Qwen 3.6 27B (MTP)27B
Workstation
Qwen 3.6 35B-A3B (MTP)35B
You are here

Strengths

    Weaknesses

      Quantization variants

      Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

      QuantizationFile sizeVRAM required

      Get the model

      HuggingFace

      Original weights

      huggingface.co/Qwen/Qwen3.6-35B-A3B-MTP

      Source repository — direct quantization required.

      Hardware that runs this

      Cards with enough VRAM for at least one quantization of Qwen 3.6 35B-A3B (MTP).

      AMD Ryzen AI Max+ 395 (Strix Halo)
      GB · amd
      NVIDIA GB200 NVL72
      13824GB · nvidia
      AMD Instinct MI355X
      288GB · amd
      AMD Instinct MI325X
      256GB · amd
      AMD Instinct MI300X
      192GB · amd
      NVIDIA B200
      192GB · nvidia
      NVIDIA H100 NVL
      188GB · nvidia
      NVIDIA H200
      141GB · nvidia

      Frequently asked

      Can I use Qwen 3.6 35B-A3B (MTP) commercially?

      Yes — Qwen 3.6 35B-A3B (MTP) ships under the Apache-2.0, which permits commercial use. Always read the license text before deployment.

      What's the context length of Qwen 3.6 35B-A3B (MTP)?

      Qwen 3.6 35B-A3B (MTP) supports a context window of 262,144 tokens (about 262K).

      Source: huggingface.co/Qwen/Qwen3.6-35B-A3B-MTP

      Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

      Related — keep moving

      Compare hardware
      • RTX 3090 vs RTX 5080 (24 vs 16 GB) →
      • Used 3090 vs 4090 →
      Buyer guides
      • Best GPU for local AI — 32B-class models →
      • Best laptop for local AI →
      • Best Mac for local AI →
      • Best used GPU for local AI →
      • Will it run on my hardware? →
      When it doesn't work
      • CUDA out of memory →
      • Ollama running slowly →
      • ROCm not detected →
      • Model keeps crashing →
      Recommended hardware
      • AMD Ryzen AI Max+ 395 (Strix Halo) →
      • NVIDIA GB200 NVL72 →
      • AMD Instinct MI355X →
      • AMD Instinct MI325X →
      • AMD Instinct MI300X →
      Alternatives
      Qwen 3.6 27B (MTP)
      Before you buy

      Verify Qwen 3.6 35B-A3B (MTP) runs on your specific hardware before committing money.

      Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →
      RUNLOCALAI

      Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

      OP·Fredoline Eruo
      DIR
      • Models
      • Hardware
      • Tools
      • Benchmarks
      TOOLS
      • Will it run?
      • Compare hardware
      • Cost vs cloud
      • Choose my GPU
      • Quick answers
      REF
      • All buyer guides
      • Methodology
      • Glossary
      • Errors KB
      • Trust
      EDITOR
      • About
      • Author
      • How we make money
      • Editorial policy
      • Contact
      LEGAL
      • Privacy
      • Terms
      • Sitemap
      MAIL · MONTHLY DIGEST
      Get monthly local AI changes
      Monthly recap. No spam.
      DISCLOSURE

      Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

      © 2026 runlocalai.coIndependently operated
      RUNLOCALAI · v38
      Compare alternatives

      Models worth comparing

      Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

      Same tier
      Models in the same parameter band as this one
      • Qwen 3 30B-A3B
        qwen · 30B
        unrated
      • Gemma 4 31B Dense
        gemma · 31B
        unrated
      • Gemma 4 26B MoE
        gemma · 26B
        unrated
      • Nemotron 3 Nano (30B-A3B)
        other · 30B
        unrated
      Step up
      More capable — bigger memory footprint
      • Llama 3.1 Nemotron 70B Instruct
        llama · 70B
        unrated
      • Hermes 3 Llama 3.1 70B
        hermes · 70B
        unrated
      Step down
      Smaller — faster, runs on weaker hardware
      • Granite 3 MoE (3B active)
        granite · 16B
        unrated
      • DeepSeek R1 Distill Mistral 24B
        deepseek · 24B
        unrated