RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
  1. >
  2. Home
  3. /Models
  4. /Jamba 1.5 Mini
other
52B parameters
Commercial OK
·Reviewed May 2026

Jamba 1.5 Mini

AI21's hybrid Mamba-Transformer MoE. 256k context with the SSM throughput advantage.

License: Jamba Open Model License·Released Aug 22, 2024·Context: 262,144 tokens

Overview

AI21's hybrid Mamba-Transformer MoE. 256k context with the SSM throughput advantage.

How to run it

Jamba 1.5 Mini is AI21's smaller SSM-hybrid model (52B total, ~12B active via MoE). The SSM backbone enables efficient long-context handling. Run at Q4_K_M via Ollama (ollama pull jamba:1.5-mini) or llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~25-30 GB on disk. Minimum VRAM: 16 GB — RTX 4080 (16GB) at Q4_K_M with KV offload for 8K context. RTX 4090 24GB: Q4_K_M with 16K context comfortable. Recommended: single RTX 4090 24GB at Q4_K_M. Throughput: ~30-50 tok/s on RTX 4090 at Q4_K_M. SSM architecture keeps KV cache growth low — 32K+ context is practical on 24 GB. The active subset (12B) makes generation efficient. SSM layers decode sequentially — slightly lower peak tok/s than pure attention at the same active size, but the context efficiency is the tradeoff. Jamba 1.5 Mini is the most accessible SSM-hybrid model — consumer GPU friendly. For larger SSM models, see Jamba 1.5 Large.

Hardware guidance

Minimum: RTX 3060 12GB at Q3_K_M with KV offload. Recommended: RTX 4090 24GB at Q4_K_M. VRAM math: ~52B total, ~12B active. Q4_K_M ≈ 25-30 GB for full weights. Expert offload reduces VRAM to ~8-12 GB (active experts only). SSM KV cache: ~2-5 GB at 32K context (significantly less than attention models). Total at Q4 with offload: ~15-20 GB for 32K context — comfortable on 24 GB cards. RTX 3090 24GB: Q4_K_M with expert offload at 32K context. RTX 4080 16GB: Q4_K_M with expert offload at 8-16K. MacBook Pro M4 Pro 24GB+: Q4_K_M at 8-12 tok/s. Cloud: A10 24GB at Q4_K_M. SSM kernel requires CUDA 11.8+ / SM 7.5+ (Turing+). Pascal GPUs not supported.

What breaks first

  1. SSM kernel on older GPUs. Mamba kernels require Turing (SM 7.5) or newer. GTX 10-series and older won't run Jamba. Check CUDA compute capability. 2. Ollama Jamba support. Jamba's SSM-hybrid architecture may not be in Ollama's default catalog. Verify with ollama list or use raw llama.cpp. 3. Per-token speed ceiling. SSM decode is sequential — tok/s is lower than attention at the same active parameter count. Jamba 1.5 Mini trades peak speed for context efficiency. 4. Expert offload latency. When experts are in system RAM, routing to a RAM-resident expert causes 30-80ms stalls. On consumer GPUs with slow RAM (DDR4), this stall is noticeable. Use fast DDR5 RAM to minimize penalty.

Runtime recommendation

llama.cpp with -ngl 999 is the primary option — most mature Jamba/SSM support. Ollama for quick-start if Jamba tag exists. vLLM for serving (verify Jamba SSM support). Avoid MLX-LM — Apple Silicon SSM kernel is less optimized.

Common beginner mistakes

Mistake: Running Jamba on GTX 1080-class GPUs. Fix: SSM kernels require Turing+ (SM 7.5). Pascal GPUs will crash or produce undefined behavior. Mistake: Expecting 100+ tok/s because active params are only 12B. Fix: SSM decode is sequential. 30-50 tok/s at Q4 on RTX 4090 is realistic — not 100+. Mistake: Setting 256K context and expecting it to work on 24 GB. Fix: While SSM is efficient, 256K is extreme. Start at 32K, benchmark VRAM, scale up. Mistake: Using Q8 because "the file size is small." Fix: Q8 is ~50 GB — 2× Q4_K_M. Stick to Q4_K_M for consumer hardware. Q8 gains are marginal on SSM models.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (jamba-1.5)
Jamba 1.5 Mini52B
You are here
Jamba 1.5 Large398B
Frontier
Distilled / fine-tuned from this
Jamba 1.5 Large398B
Frontier

Strengths

  • 256k context
  • Hybrid SSM-Transformer
  • Long-context throughput

Weaknesses

  • Limited runtime support outside vLLM

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M30.0 GB36 GB

Get the model

HuggingFace

Original weights

huggingface.co/ai21labs/AI21-Jamba-1.5-Mini

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Jamba 1.5 Mini.

NVIDIA GB200 NVL72
13824GB · nvidia
AMD Instinct MI355X
288GB · amd
AMD Instinct MI325X
256GB · amd
AMD Instinct MI300X
192GB · amd
NVIDIA B200
192GB · nvidia
NVIDIA H100 NVL
188GB · nvidia
NVIDIA H200
141GB · nvidia
AMD Instinct MI250X
128GB · amd

Frequently asked

What's the minimum VRAM to run Jamba 1.5 Mini?

36GB of VRAM is enough to run Jamba 1.5 Mini at the Q4_K_M quantization (file size 30.0 GB). Higher-quality quantizations need more.

Can I use Jamba 1.5 Mini commercially?

Yes — Jamba 1.5 Mini ships under the Jamba Open Model License, which permits commercial use. Always read the license text before deployment.

What's the context length of Jamba 1.5 Mini?

Jamba 1.5 Mini supports a context window of 262,144 tokens (about 262K).

Source: huggingface.co/ai21labs/AI21-Jamba-1.5-Mini

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware
  • RTX 3090 vs RTX 5080 (24 vs 16 GB) →
  • Used 3090 vs 4090 →
Buyer guides
  • Best GPU for local AI — 32B-class models →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
  • Will it run on my hardware? →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Recommended hardware
  • NVIDIA GB200 NVL72 →
  • AMD Instinct MI355X →
  • AMD Instinct MI325X →
  • AMD Instinct MI300X →
  • NVIDIA B200 →
Alternatives
Jamba 1.5 Large
Before you buy

Verify Jamba 1.5 Mini runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier
Models in the same parameter band as this one
  • Llama 3.3 70B Instruct
    llama · 70B
    9.1/10
  • DeepSeek R1 Distill Llama 70B
    deepseek · 70B
    9.0/10
  • Qwen 2.5 72B Instruct
    qwen · 72B
    9.0/10
  • Llama 3.1 70B Instruct
    llama · 70B
    8.0/10
Step up
More capable — bigger memory footprint
  • DeepSeek V4 Pro (1.6T MoE)
    deepseek · 1600B
    unrated
  • Qwen 3.5 235B-A17B (MoE)
    qwen · 397B
    unrated
Step down
Smaller — faster, runs on weaker hardware
  • Qwen 3 30B-A3B
    qwen · 30B
    unrated
  • Gemma 4 31B Dense
    gemma · 31B
    unrated