other
72B parameters
Commercial OK
Multimodal
Reviewed May 2026

Molmo 72B

Molmo flagship. Apache 2.0 VLM rivaling proprietary models on UI pointing and visual reasoning.

License: Apache 2.0·Released Sep 25, 2024·Context: 4,096 tokens

Overview

Molmo flagship. Apache 2.0 VLM rivaling proprietary models on UI pointing and visual reasoning.

How to run it

Molmo 72B is Ai2's vision-language model — 72B dense backbone with a custom vision encoder. Designed for strong visual understanding with a focus on pointing/grounding (can reference specific image regions). Run at Q4_K_M via llama.cpp with llava-server for vision. Q4_K_M file size ~41 GB (text) + ~3-5 GB (vision). Minimum VRAM: 48 GB — RTX A6000 at Q3_K_M with vision. Recommended: A100 80GB at AWQ-INT4. Throughput: ~12-20 tok/s on A6000 at Q4_K_M text-only; vision adds 1-3s encoding. Molmo's unique feature is pixel-precise pointing — it can identify regions in images by coordinates, useful for UI automation, visual QA with grounding, and robotics. Ai2's license is permissive (Apache 2.0). Ecosystem support is narrower than Llama/Qwen vision models — verify llama.cpp Molmo support. Ollama may not have Molmo — use raw llama.cpp. For serving: vLLM if Molmo is registered as a supported architecture.

Hardware guidance

Minimum: RTX A6000 48GB at Q3_K_M + vision (4K context). Recommended: A100 80GB at AWQ-INT4. VRAM math: 72B dense at Q4_K_M ≈ 41 GB. Molmo vision encoder: 3-5 GB. KV cache at 8K: ~10 GB. Total: ~54-56 GB. A6000 48GB: Q3_K_M (31 GB) + vision at 4K context. A100 80GB: comfortable for Q4 + vision + 8K. Dual RTX 4090: row-split text + vision VRAM split across cards. Mac Studio M4 Ultra 128GB: Q4_K_M + vision, 2-5 tok/s (Molmo support on Apple Silicon uncertain). Cloud: A100 at $5-10/hr. AWQ-INT4 on A100 enables 16K+ context.

What breaks first

  1. Molmo GGUF availability. Pre-converted Molmo GGUFs are rare. You may need to convert from hf using Ai2's conversion script. Verify GGUF or AWQ availability before provisioning hardware. 2. Pointing/grounding in local inference. Molmo's coordinate outputs rely on specific output formatting tokens. llama.cpp may not parse these correctly — verify that coordinate outputs are well-formed before trusting results. 3. Vision encoder compatibility. Molmo uses a custom vision encoder (not CLIP, not InternViT). llama.cpp's standard llava implementation may not support it without model-specific patches. 4. Apache 2.0 but verify. While Molmo is Apache 2.0 licensed, the vision encoder or training data may have additional restrictions. Check the full license on huggingface.co/allenai/Molmo-72B.

Runtime recommendation

llama.cpp with custom Molmo support (verify in your build). Ai2 may provide official inference code — prefer that over community tooling if available. vLLM for serving if Molmo is registered. Avoid Ollama unless Molmo is explicitly supported. Test with a small image before scaling.

Common beginner mistakes

Mistake: Expecting Molmo to work with standard Ollama vision commands. Fix: Molmo requires custom model registration in llama.cpp. Test with raw llama.cpp and verify the multimodal GGUF. Mistake: Ignoring the pointing/grounding output format. Fix: Molmo outputs coordinates in a specific format. Parse these explicitly — don't treat them as regular text. Mistake: Using a Llama 3.2 Vision mmproj with Molmo. Fix: Vision projectors are architecture-specific. Download or convert the Molmo-specific projector. Mistake: Assuming Molmo's text quality matches Qwen 3 72B. Fix: Molmo is optimized for vision grounding — general text quality may be lower than same-sized general-purpose models. Test text-only tasks before deploying.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model
Molmo 7B-D8B
Consumer
Family siblings (molmo)
Molmo 7B-D8B
Consumer
Molmo 72B72B
You are here

Strengths

  • Apache 2.0
  • Frontier UI grounding

Weaknesses

  • 48GB+ VRAM tier

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M41.0 GB48 GB

Get the model

HuggingFace

Original weights

huggingface.co/allenai/Molmo-72B-0924

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Molmo 72B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Molmo 72B?

48GB of VRAM is enough to run Molmo 72B at the Q4_K_M quantization (file size 41.0 GB). Higher-quality quantizations need more.

Can I use Molmo 72B commercially?

Yes — Molmo 72B ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Molmo 72B?

Molmo 72B supports a context window of 4,096 tokens (about 4K).

Does Molmo 72B support images?

Yes — Molmo 72B is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/allenai/Molmo-72B-0924

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Alternatives
Before you buy

Verify Molmo 72B runs on your specific hardware before committing money.