Molmo 72B

Molmo flagship. Apache 2.0 VLM rivaling proprietary models on UI pointing and visual reasoning.

License: Apache 2.0·Released Sep 25, 2024·Context: 4,096 tokens

Overview

Molmo flagship. Apache 2.0 VLM rivaling proprietary models on UI pointing and visual reasoning.

How to run it

Molmo 72B is Ai2's vision-language model — 72B dense backbone with a custom vision encoder. Designed for strong visual understanding with a focus on pointing/grounding (can reference specific image regions). Run at Q4_K_M via llama.cpp with llava-server for vision. Q4_K_M file size ~41 GB (text) + ~3-5 GB (vision). Minimum VRAM: 48 GB — RTX A6000 at Q3_K_M with vision. Recommended: A100 80GB at AWQ-INT4. Throughput: ~12-20 tok/s on A6000 at Q4_K_M text-only; vision adds 1-3s encoding. Molmo's unique feature is pixel-precise pointing — it can identify regions in images by coordinates, useful for UI automation, visual QA with grounding, and robotics. Ai2's license is permissive (Apache 2.0). Ecosystem support is narrower than Llama/Qwen vision models — verify llama.cpp Molmo support. Ollama may not have Molmo — use raw llama.cpp. For serving: vLLM if Molmo is registered as a supported architecture.

Hardware guidance

Minimum: RTX A6000 48GB at Q3_K_M + vision (4K context). Recommended: A100 80GB at AWQ-INT4. VRAM math: 72B dense at Q4_K_M ≈ 41 GB. Molmo vision encoder: 3-5 GB. KV cache at 8K: ~10 GB. Total: ~54-56 GB. A6000 48GB: Q3_K_M (31 GB) + vision at 4K context. A100 80GB: comfortable for Q4 + vision + 8K. Dual RTX 4090: row-split text + vision VRAM split across cards. Mac Studio M4 Ultra 128GB: Q4_K_M + vision, 2-5 tok/s (Molmo support on Apple Silicon uncertain). Cloud: A100 at $5-10/hr. AWQ-INT4 on A100 enables 16K+ context.

What breaks first

Molmo GGUF availability. Pre-converted Molmo GGUFs are rare. You may need to convert from hf using Ai2's conversion script. Verify GGUF or AWQ availability before provisioning hardware. 2. Pointing/grounding in local inference. Molmo's coordinate outputs rely on specific output formatting tokens. llama.cpp may not parse these correctly — verify that coordinate outputs are well-formed before trusting results. 3. Vision encoder compatibility. Molmo uses a custom vision encoder (not CLIP, not InternViT). llama.cpp's standard llava implementation may not support it without model-specific patches. 4. Apache 2.0 but verify. While Molmo is Apache 2.0 licensed, the vision encoder or training data may have additional restrictions. Check the full license on huggingface.co/allenai/Molmo-72B.

Runtime recommendation

llama.cpp with custom Molmo support (verify in your build). Ai2 may provide official inference code — prefer that over community tooling if available. vLLM for serving if Molmo is registered. Avoid Ollama unless Molmo is explicitly supported. Test with a small image before scaling.

Common beginner mistakes

Mistake: Expecting Molmo to work with standard Ollama vision commands. Fix: Molmo requires custom model registration in llama.cpp. Test with raw llama.cpp and verify the multimodal GGUF. Mistake: Ignoring the pointing/grounding output format. Fix: Molmo outputs coordinates in a specific format. Parse these explicitly — don't treat them as regular text. Mistake: Using a Llama 3.2 Vision mmproj with Molmo. Fix: Vision projectors are architecture-specific. Download or convert the Molmo-specific projector. Mistake: Assuming Molmo's text quality matches Qwen 3 72B. Fix: Molmo is optimized for vision grounding — general text quality may be lower than same-sized general-purpose models. Test text-only tasks before deploying.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

Molmo 7B-D8B

Consumer

Family siblings (molmo)

Molmo 7B-D8B

Consumer

Molmo 72B72B

You are here