RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Glossary / Frameworks & tools / llama.cpp
Frameworks & tools

llama.cpp

Also known as: llamacpp, llama cpp

llama.cpp is a C++ inference engine for running large language models (LLMs) locally on consumer hardware. It loads quantized model weights (e.g., GGUF format) and executes them on CPU or GPU, with support for Apple Metal, NVIDIA CUDA, AMD ROCm, and Vulkan. The project prioritizes minimal dependencies, low memory footprint, and efficient CPU inference via integer quantization (e.g., 4-bit, 5-bit) and optimized kernels. Operators use it directly via command line or through wrappers like Ollama and LM Studio.

Deeper dive

llama.cpp was created by Georgi Gerganov to run LLaMA models on a MacBook without a GPU. It introduced the GGUF format, which packages quantized weights, tokenizer, and model metadata into a single file. The engine supports various quantization levels (Q2_K through Q8_0) that trade precision for VRAM usage. It also implements a batched inference mode for higher throughput and a server mode with an OpenAI-compatible API. Key optimizations include K-quantization, which adapts quantization precision per layer, and memory-mapped loading for fast startup. The project has spawned many forks and integrations, making it the de facto standard for local LLM deployment on CPU and hybrid GPU setups.

Practical example

An operator with an RTX 3060 (12 GB VRAM) can run Llama 3.1 8B at Q4_K_M (5 GB) entirely on GPU, achieving ~30 tokens/sec. For a larger model like Mistral 7B at Q5_K_M (5.5 GB), the same card still fits. But trying to run Llama 3.1 70B at Q4_K_M (~40 GB) would require offloading layers to system RAM, dropping speed to ~2 tokens/sec. The operator would use a command like ./main -m model.gguf -n 256 -ngl 35 to offload 35 layers to GPU.

Workflow example

In Ollama, when you run ollama pull llama3.1:8b, it downloads a GGUF file and uses llama.cpp as the runtime. Under the hood, Ollama calls llama.cpp with parameters like -ngl 99 to offload all layers to GPU. If VRAM is insufficient, Ollama automatically reduces the offload count. Operators can also run llama.cpp directly: ./main -m model.gguf -p "Hello" -n 128 -t 8 uses 8 CPU threads. The server mode (./server -m model.gguf) provides an HTTP endpoint compatible with OpenAI client libraries.

Related terms

GGUFQuantizationGGMLKoboldCppOllama

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →