by Google DeepMind
Google's open-weight derivative of Gemini research. Gemma 2 + Gemma 3 cover sub-30B chat; CodeGemma adds code-specialized variants. Tight integration with Vertex AI / Google Cloud / Android Studio.
Start with Gemma 3 12B at Q4_K_M via Ollama — fits on single RTX 3060 12GB at Q4 (7 GB VRAM). The 12B delivers MMLU ~82% and punches above its weight class on instruction-following (IFEval ~78%). Google's distillation from Gemini training data gives Gemma 3 context-handling quality that smaller models rarely achieve — usable 32K context without perplexity collapse. For minimum VRAM (<8 GB), use Gemma 3 4B Q4_K_M (3 GB) — runs on any laptop with integrated GPU at 15+ tok/s via llama.cpp. Skip Gemma 2 27B for local deployment — its 256K vocab tokenizer wastes ~25% more tokens on English vs Llama tokenizer, inflating effective context cost. Skip Gemma 1 entirely — Gemma 3 12B matches or exceeds Gemma 2 27B on benchmarks at half the VRAM.
For single-user local: Ollama + gemma3:12b Q4_K_M on RTX 3060 12GB or Apple M3 via MLX-LM. Gemma's GeGLU activation and 256K vocab require GGUF format built with latest llama.cpp (b3400+) for correct RoPE theta. For multi-user serving: vLLM 0.6.1+ with AWQ 4-bit on L4 24 GB — Gemma's dense architecture parallelizes efficiently. For mobile/edge: MediaPipe LLM Inference on Tensor G4 via Google AI Edge — Gemma 3 4B runs entirely on-device at ~12 tok/s with 4-bit WebGPU acceleration. For NVIDIA GPU maximum throughput: TensorRT-LLM FP8 on L40S. Note: Gemma's bespoke license prohibits use for generating training data for competing models — review terms before production deployment. See GPU buyer guide.
Models in this family with our verdicts
Verify Gemma runs on your specific hardware before committing money.