How to benchmark local AI properly
The pillar guide on running a local AI benchmark someone else can actually trust. What to measure (decode tok/s, TTFT, VRAM peak, sustained tok/s), how to measure it on vLLM / llama.cpp / Ollama / MLX, what to pin (versions, drivers, OS), what not to claim, and how to submit it to RunLocalAI.
Why a benchmark needs a contract
Most local-AI benchmarks shared on Reddit, X, and YouTube are not actually benchmarks. They're a number with a model name and a GPU name attached, and the implicit claim is “this is what running this model on this GPU feels like.” The number is usually defensible at the moment it was measured. The problem is everything else: the prompt length, the batch, the runtime version, the driver, the quantization, the context, the warmup, the thermal state. Change any one of those and the number changes 20-50%.
A benchmark that someone else can trust is a benchmark someone else can reproduce. That requires a contract: a fixed measurement target, a fixed cohort definition, a recorded environment, and a published artifact. The rest of this guide is what each of those means in practice. Once you publish a benchmark this way, it can graduate up the trust ladder — see the confidence methodology for how RunLocalAI rates incoming submissions, and /trust/benchmarks for the full data-source policy.
What to measure
There is no single “tok/s” number. A benchmark is incomplete unless it splits the load into the dimensions that actually move independently:
- Decode tok/s — the steady-state generation speed once the model has finished prefill. This is what an interactive chat session feels like after the first token. Always report this separately from prefill.
- Time to first token (TTFT) — wall-clock seconds from the moment you submit a prompt until the first token appears. TTFT scales with prompt length and batch; it is the latency operators care about for agent loops, voice interfaces, and streaming UIs. Decode tok/s alone hides this.
- Sustained tok/s — measured over a long-form generation (1,000+ output tokens). This catches thermal throttling and cache-pressure effects that a 30-second burst hides.
- Burst tok/s (optional, label it) — the peak you can hit on the first 100-200 tokens with a cold runtime. Useful for marketing, dishonest if reported as the experience.
- VRAM peak — the highest
nvidia-smi/rocm-smi/ Metal allocation reading observed during a full generation, including KV cache. This number is what tells someone else whether the configuration fits on their card. - Power draw, if instrumented — board power in watts averaged over the sustained run. Per-watt efficiency is a real comparison axis once you cross 200W TDP.
- Eval scores via lm-evaluation-harness, if applicable — for any quantization claim (“Q4_K_M doesn't hurt quality much”), the only honest evidence is a score against a fixed eval harness. Throughput numbers say nothing about correctness.
Each of these is a separate column in any honest benchmark table. Bundling them all into a single “tok/s” is the single most common error.
How to measure on each runtime
The exact command lines below are what we use internally to populate the catalog. Adapt the model paths and ports to your setup; the flags are the load-bearing part.
vLLM
vLLM has a built-in benchmarking harness in vllm/benchmarks/. The benchmark_serving.py script measures TTFT, decode tok/s, and request throughput against a running vLLM server with a real workload distribution.
# Server
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-32B-Instruct-AWQ \
--quantization awq \
--max-model-len 8192 \
--gpu-memory-utilization 0.92
# Benchmark client (in vllm repo)
python benchmarks/benchmark_serving.py \
--model Qwen/Qwen2.5-32B-Instruct-AWQ \
--dataset-name sharegpt \
--num-prompts 200 \
--request-rate 4Report: median TTFT, P99 TTFT, decode tok/s per request, total throughput, peak VRAM via nvidia-smi --query-gpu=memory.used --format=csv -lms 500 in a side terminal. See /tools/vllm for the runtime profile.
llama.cpp
llama.cpp ships llama-bench, the reference benchmarking binary. It measures pp (prompt processing, i.e. prefill) and tg (token generation, i.e. decode) separately, exactly the split you want.
./llama-bench \
-m models/qwen2.5-32b-instruct-q4_k_m.gguf \
-p 512 \
-n 256 \
-ngl 99 \
-t 8 \
-r 5-p 512 sets prefill length, -n 256 sets decode length, -ngl 99 offloads all layers to GPU, -r 5 runs five repetitions. The output table reports pp tok/s and tg tok/s with standard deviations across runs. See /tools/llama-cpp.
Ollama
Ollama doesn't ship a benchmark harness, but it exposes the underlying llama.cpp metrics on every API response under the eval_count, eval_duration, prompt_eval_count, and prompt_eval_duration fields. Decode tok/s is eval_count / (eval_duration / 1e9).
curl -s http://localhost:11434/api/generate -d '{
"model": "qwen2.5:32b",
"prompt": "Write a 500-word essay on memory bandwidth.",
"stream": false
}' | jq '{
prefill_tokens: .prompt_eval_count,
decode_tokens: .eval_count,
decode_tps: (.eval_count / (.eval_duration / 1e9))
}'Run this 5+ times after a warmup request and report the median. See /tools/ollama.
MLX (Apple Silicon)
mlx-lm exposes a generation benchmark via mlx_lm.generate with --verbose, which prints prefill tok/s, decode tok/s, and peak memory at the end of each run.
mlx_lm.generate \
--model mlx-community/Qwen2.5-32B-Instruct-4bit \
--prompt "Explain memory bandwidth in 500 words." \
--max-tokens 500 \
--verboseRepeat 5+ times, report median decode tok/s, observe activity monitor or powermetrics --samplers gpu_power for peak unified-memory and power. See /tools/mlx-lm.
Versions are part of the measurement
A benchmark without recorded versions is a number without a year. Inference engines move fast enough that a 15-25% throughput swing between minor versions is normal. A driver update can move the same number 5-15%. The minimum environmental record we expect on every submission to RunLocalAI:
- Runtime version — exact tag or commit.
vllm --version,./llama-cli --version(shows the build commit),ollama --version,mlx-lm --version. - Driver version —
nvidia-smitop header on NVIDIA,rocm-smi --showdriverversionon AMD,system_profiler SPSoftwareDataTypeon macOS. - CUDA / ROCm / Metal version —
nvcc --version,rocm-smi --showversion, the macOS build number. - OS and kernel —
uname -aon Linux, the macOS build, the Windows build number. - Quantization — exact format string (Q4_K_M, AWQ-INT4, GPTQ-int4, MLX-4bit, EXL2-4.5bpw). “4-bit” is not a quantization.
- Context length — both the configured max and the actual prompt length used during the measurement.
- Batch size — explicit. Single-stream tok/s is a different cohort than batch=N tok/s.
See /resources/versioned-benchmarking for why we treat the runtime version as part of the cohort key, not a footnote.
Reproducibility — the artifact you publish
The output of a benchmark run is not a number. It's an artifact someone else can re-run. The full reproduction protocol is at /resources/reproduction-guide; the abbreviated checklist:
- The exact command you ran, copy-pasteable.
- The runtime stdout / log output, captured to a file.
- The driver state (
nvidia-smi -q,rocm-smi -a,system_profiler SPDisplaysDataType). - The model file checksum if you're measuring a quantization variant. Two GGUFs claiming to be “Qwen2.5-32B Q4_K_M” are not necessarily the same file.
- The number of repetitions and the median + standard deviation, not just one number.
- Thermal context — was this 30 seconds after boot, or 20 minutes into a sustained workload? Disclose it.
You can see what reproduction looks like in production at /benchmarks/reproduce, our open queue of benchmarks operators are independently reproducing.
What not to claim
The dishonest moves are mostly unintentional. They become dishonest when published anyway.
- Burst tok/s as sustained. A 30-second run on a cold laptop GPU does not measure the experience of a 10-minute coding session. If the laptop throttles 25% after two minutes, the 30-second number is misleading.
- Cherry-picked best run. Five runs, report the fastest — this is not measurement, it's curation. Always report the median and the spread.
- A throttled GPU as a benchmark. If the GPU was thermal-limited or power-limited during the run, the number measures the throttle, not the GPU.
- Cross-runtime comparison without cohort fixing. “vLLM does 80 tok/s, llama.cpp does 60” is meaningless without the same model, same quant, same hardware, same prompt distribution. We talk through these traps explicitly in /guides/local-ai-benchmarking-mistakes.
- Throughput as a quality claim. Q2_K is faster than Q4_K_M. It is also worse. Speed alone tells you nothing about whether the model still produces the right answer.
Submitting to RunLocalAI
If you've done the work above, your benchmark is most of the way to a public, citable record. Three submission paths:
- /submit/benchmark — the general-purpose form for any model × hardware × runtime measurement. Goes through the trust ladder at /trust/benchmarks and lands in the catalog with a confidence tier from /resources/confidence-methodology.
- /benchmarks/wanted — the open queue of triples we don't have and would prioritize. If your hardware matches one, your submission jumps a confidence tier.
- /benchmarks/request — if you want a specific cohort measured by an operator with that hardware, post a request. The market matches it to a verified operator.
Reading /guides/local-ai-benchmarking-mistakes next is recommended — it's the negative-space companion to this guide and the single best way to spot the failure modes in your own measurement before you publish.
Next recommended step
If you ran a measurement using the protocol above, this is the path into the catalog.
Benchmarking is only useful when you can compare your numbers against a known baseline from the same hardware class. A 3090 running Qwen 2.5 at 4-bit quantization should land within a predictable range, and if it does not, something is wrong with your configuration. The hardware verdicts on this site exist precisely to provide those baselines so you are not guessing whether your rig is performing as it should relative to equivalent builds.
The hardware baselines you are measuring against: best GPU for Qwen, and custom GPU comparison.