What's the minimum VRAM to run Aya Expanse 32B?

22GB of VRAM is enough to run Aya Expanse 32B at the AWQ-INT4 quantization (file size 19.0 GB). Higher-quality quantizations need more.

Can I use Aya Expanse 32B commercially?

Aya Expanse 32B is released under the CC-BY-NC-4.0, which has restrictions for commercial use. Review the license terms before using it in a product.

What's the context length of Aya Expanse 32B?

Aya Expanse 32B supports a context window of 8,192 tokens (about 8K).

Aya Expanse 32B — local inference guide

Positioning

Cohere Aya Expanse 32B is the latest in Cohere For AI's multilingual research lineage — 32 billion parameters dense, instruction-tuned for 23+ languages with explicit balance across Arabic, Chinese, Japanese, Korean, Turkish, Russian, Spanish, French, German, and 14+ others. Released under CC-BY-NC-4.0 (research/non-commercial). The model is trained from a Llama-style base with Cohere's Aya multilingual pretraining + instruction-tuning recipe — the canonical "open-weight 30B-class multilingual model" in 2026.

Strengths

Multilingual coverage is genuinely best-in-class for the parameter tier. 23+ languages with balanced quality is meaningfully better than Llama 3 / Qwen 3 at the same parameter count, which lean English-heavy.
Strong on under-served languages. Arabic, Korean, Hebrew, Turkish, Vietnamese — languages where Llama 3 lags meaningfully.
32B parameter dense fits cleanly on a single 48 GB GPU at FP16 (RTX 6000 Ada, L40S) or 24 GB at Q4-Q5 (RTX 4090 / RTX 5090).
Instruction-tuning is conservative and predictable. Doesn't have the "personality" RLHF of Llama 3.x but is reliable for production translation + multilingual chat workflows.

Limitations

License is non-commercial. CC-BY-NC-4.0 — production commercial deployments require Cohere licensing. Single biggest practical limitation.
Reasoning is not class-leading. DeepSeek V3 and Qwen 3 dramatically beat Aya on math/code/logic.
English-only quality is below Llama 3.1 70B / Qwen 3 32B. The multilingual-balanced training trades English performance for cross-language consistency.
Tool-use / function-calling is basic. Pre-trained for chat, not optimized for agentic workflows.
No long-context strength. 8K context standard, with degradation at 16K+.

Real-world performance

vs Llama 3.1 8B / Llama 3.1 70B: Llama wins for English-only at the parameter-equivalent tier. Aya Expanse 32B wins clearly on Arabic/Korean/Japanese/Vietnamese.
vs Qwen 3 32B: Qwen 3 32B is stronger overall + has Chinese-English balance. Aya Expanse 32B has wider language coverage but weaker per-language depth.
vs Command R+ 104B: Command R+ is the larger Cohere sibling with retrieval-grounding focus. Aya Expanse 32B is the cheaper-to-serve multilingual chat option.
vs Google Gemma 2 27B: Comparable parameter tier. Gemma stronger on English; Aya stronger on multilingual.

Should you run this locally?

Yes if you specifically need 30B-class multilingual chat for research / non-commercial use, your target language mix includes underserved languages (Arabic, Korean, Vietnamese, Hebrew, Turkish), and your deployment is research / academic / non-commercial.

No if you need permissive commercial licensing (pick Llama 3.1 70B or Qwen 3 32B), reasoning-heavy workloads (pick DeepSeek/Qwen 3), or English-only workflows (Llama / Qwen win).

How it compares

vs aya-23-35b: Aya Expanse is the architectural successor with refined instruction-tuning.
vs aya-23-8b: Aya 8B is the smaller sibling for cheaper inference at lower capability tier.
vs Command R 35B: Command R is RAG-tuned; Aya is multilingual-tuned. Different specializations.
vs Google Gemma 2 27B: Gemma stronger English; Aya stronger multilingual.

Run this yourself

Single 24 GB GPU at Q4-Q5: RTX 4090, RTX 5090, used 3090.
Single 48 GB workstation at FP16: RTX 6000 Ada, L40S.
Apple Silicon at FP16: Mac Studio M3 Ultra / MacBook Pro M4 Max (96+ GB).
vLLM serving: vllm serve CohereForAI/aya-expanse-32b --max-model-len 8192.
Cloud rental: Runpod / Lambda L40S ~$1.50-2.50/hr.

Quantization	File size	VRAM required
AWQ-INT4	19.0 GB	22 GB

Positioning

Strengths

Multilingual coverage is genuinely best-in-class for the parameter tier. 23+ languages with balanced quality is meaningfully better than Llama 3 / Qwen 3 at the same parameter count, which lean English-heavy.
Strong on under-served languages. Arabic, Korean, Hebrew, Turkish, Vietnamese — languages where Llama 3 lags meaningfully.
32B parameter dense fits cleanly on a single 48 GB GPU at FP16 (RTX 6000 Ada, L40S) or 24 GB at Q4-Q5 (RTX 4090 / RTX 5090).
Instruction-tuning is conservative and predictable. Doesn't have the "personality" RLHF of Llama 3.x but is reliable for production translation + multilingual chat workflows.

Limitations

License is non-commercial. CC-BY-NC-4.0 — production commercial deployments require Cohere licensing. Single biggest practical limitation.
Reasoning is not class-leading. DeepSeek V3 and Qwen 3 dramatically beat Aya on math/code/logic.
English-only quality is below Llama 3.1 70B / Qwen 3 32B. The multilingual-balanced training trades English performance for cross-language consistency.
Tool-use / function-calling is basic. Pre-trained for chat, not optimized for agentic workflows.
No long-context strength. 8K context standard, with degradation at 16K+.

Real-world performance

vs Llama 3.1 8B / Llama 3.1 70B: Llama wins for English-only at the parameter-equivalent tier. Aya Expanse 32B wins clearly on Arabic/Korean/Japanese/Vietnamese.
vs Qwen 3 32B: Qwen 3 32B is stronger overall + has Chinese-English balance. Aya Expanse 32B has wider language coverage but weaker per-language depth.
vs Command R+ 104B: Command R+ is the larger Cohere sibling with retrieval-grounding focus. Aya Expanse 32B is the cheaper-to-serve multilingual chat option.
vs Google Gemma 2 27B: Comparable parameter tier. Gemma stronger on English; Aya stronger on multilingual.

Should you run this locally?

No if you need permissive commercial licensing (pick Llama 3.1 70B or Qwen 3 32B), reasoning-heavy workloads (pick DeepSeek/Qwen 3), or English-only workflows (Llama / Qwen win).

How it compares

vs aya-23-35b: Aya Expanse is the architectural successor with refined instruction-tuning.
vs aya-23-8b: Aya 8B is the smaller sibling for cheaper inference at lower capability tier.
vs Command R 35B: Command R is RAG-tuned; Aya is multilingual-tuned. Different specializations.
vs Google Gemma 2 27B: Gemma stronger English; Aya stronger multilingual.

Run this yourself

Single 24 GB GPU at Q4-Q5: RTX 4090, RTX 5090, used 3090.
Single 48 GB workstation at FP16: RTX 6000 Ada, L40S.
Apple Silicon at FP16: Mac Studio M3 Ultra / MacBook Pro M4 Max (96+ GB).
vLLM serving: vllm serve CohereForAI/aya-expanse-32b --max-model-len 8192.
Cloud rental: Runpod / Lambda L40S ~$1.50-2.50/hr.

Positioning

Strengths

Limitations

Real-world performance

Should you run this locally?

How it compares

Run this yourself

Overview

Family & lineage

Strengths

Weaknesses

Quantization variants

Get the model

HuggingFace

Hardware that runs this

Frequently asked

What's the minimum VRAM to run Aya Expanse 32B?

Can I use Aya Expanse 32B commercially?

What's the context length of Aya Expanse 32B?

Related — keep moving

Positioning

Strengths

Limitations

Real-world performance

Should you run this locally?

How it compares

Run this yourself

Overview

Family & lineage

Strengths

Weaknesses

Quantization variants

Get the model

HuggingFace

Hardware that runs this

Frequently asked

What's the minimum VRAM to run Aya Expanse 32B?

Can I use Aya Expanse 32B commercially?

What's the context length of Aya Expanse 32B?

Related — keep moving