by Cohere
Cohere's RAG-tuned open-weight family with first-class document citation + multilingual coverage. Non-commercial license is the practical limit vs Llama / Qwen for production use.
Start with Command R 35B at Q4_K_M via Ollama — fits on 2× RTX 3060 12GB (24 GB total) or single RTX 4090 24 GB. Command R is purpose-built for RAG (retrieval-augmented generation) with native grounded generation — it cites sources and resists hallucination better than any comparably-sized open-weight model. The 128K native context window is real, not interpolated — Command R maintains retrieval accuracy above 90% at 128K on RULER. For lower VRAM (<16 GB), Command R 7B Q4 (~5 GB) runs on MacBook Pro M4 Max — retains the grounded generation capability at consumer scale. Skip Command R+ (104B) for local use — it requires datacenter hardware. The CC-BY-NC-4.0 license prohibits commercial use — this is the biggest practical limitation. For production RAG pipelines where commercial use is required, use Llama 3.3 70B instead.
For single-user local: Ollama + command-r:35b Q4_K_M on 2× RTX 3060 12GB or single RTX 4090. Command R uses Cohere's custom tokenizer (256K vocab) — GGUF is required; AWQ/GPTQ/EXL2 support is community-only and not production-grade. For multi-user serving: vLLM 0.5.5+ on 2× A100 40GB — the 128K context requires ~100 GB total VRAM for KV-cache at high concurrency. For RAG-optimized serving: pair Command R with llama.cpp server mode and a separate embedding server running BGE-M3 for retrieval. The grounded generation prompt format uses <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> delimiters — incorrect formatting disables citation behavior. CC-BY-NC-4.0 blocks production deployment for revenue-generating services — verify license compatibility before deploying beyond personal/research use.
Models in this family with our verdicts
Verify Command R runs on your specific hardware before committing money.