llama.cpp
Also known as: llamacpp, llama cpp
llama.cpp is a C++ inference engine for running large language models (LLMs) locally on consumer hardware. It loads quantized model weights (e.g., GGUF format) and executes them on CPU or GPU, with support for Apple Metal, NVIDIA CUDA, AMD ROCm, and Vulkan. The project prioritizes minimal dependencies, low memory footprint, and efficient CPU inference via integer quantization (e.g., 4-bit, 5-bit) and optimized kernels. Operators use it directly via command line or through wrappers like Ollama and LM Studio.
Deeper dive
llama.cpp was created by Georgi Gerganov to run LLaMA models on a MacBook without a GPU. It introduced the GGUF format, which packages quantized weights, tokenizer, and model metadata into a single file. The engine supports various quantization levels (Q2_K through Q8_0) that trade precision for VRAM usage. It also implements a batched inference mode for higher throughput and a server mode with an OpenAI-compatible API. Key optimizations include K-quantization, which adapts quantization precision per layer, and memory-mapped loading for fast startup. The project has spawned many forks and integrations, making it the de facto standard for local LLM deployment on CPU and hybrid GPU setups.
Practical example
An operator with an RTX 3060 (12 GB VRAM) can run Llama 3.1 8B at Q4_K_M (5 GB) entirely on GPU, achieving ~30 tokens/sec. For a larger model like Mistral 7B at Q5_K_M (5.5 GB), the same card still fits. But trying to run Llama 3.1 70B at Q4_K_M (~40 GB) would require offloading layers to system RAM, dropping speed to ~2 tokens/sec. The operator would use a command like ./main -m model.gguf -n 256 -ngl 35 to offload 35 layers to GPU.
Workflow example
In Ollama, when you run ollama pull llama3.1:8b, it downloads a GGUF file and uses llama.cpp as the runtime. Under the hood, Ollama calls llama.cpp with parameters like -ngl 99 to offload all layers to GPU. If VRAM is insufficient, Ollama automatically reduces the offload count. Operators can also run llama.cpp directly: ./main -m model.gguf -p "Hello" -n 128 -t 8 uses 8 CPU threads. The server mode (./server -m model.gguf) provides an HTTP endpoint compatible with OpenAI client libraries.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.