Methodology · Operator checklist

Local AI benchmarking methodology — operator checklist

A working checklist for benchmarking a local AI setup honestly. Built for the Reddit poster, the GitHub issue commenter, the YouTube creator with a fresh GPU on the bench, and anyone trying to cite a tok/s number that will hold up to a follow-up question. The bar is not perfection. The bar is enough metadata that the next operator can reproduce the measurement.

Editorial(Methodology)
By Fredoline Eruo · Last reviewed 2026-05-08

Before you measure

The most common reason two operators report wildly different tok/s numbers for “the same” setup is that they were not running the same setup. Before you click run, pin the five things that drift hardest: runtime version, driver version, model file, quantization format, and context length. Then write down the hardware and the thermal state. None of this is exotic; the discipline is doing it before the measurement, not after.

Pin the runtime version exactly. “Latest llama.cpp” is not a version; b3447 is. vllm 0.6.4 is. Pin the driver — for NVIDIA that is the output of nvidia-smi --query-gpu=driver_version --format=csv; for AMD it is rocm-smi --showdriverversion; for Apple Silicon it is sw_vers plus the Metal version from system_profiler SPDisplaysDataType. Pin the model file by name and by checksum. The GGUF file you downloaded last month is not necessarily the GGUF file someone else has — uploaders re-quantize and re-upload, and the filename does not always change. Run sha256sum model.gguf and record the first eight characters of the digest.

Pin the quantization. Q4_K_M and Q4_0 are different files with different quality and different speed. FP8, AWQ, GPTQ-Int4 are all distinct. Pin the context length you actually used for the run, because tok/s at 2k context and tok/s at 32k context are not comparable. And document the hardware: GPU model, RAM (system + VRAM), CPU, and the cooling state — was the GPU at idle temperature when the run started, or had it been thermally loaded for an hour?

What to measure

The numbers worth publishing fall into five categories. They are easy to confuse with each other and the confusion is where most public benchmarks lose their citation value.

Decode tok/s is the steady-state generation speed once the model has produced its first few tokens. This is the number people usually mean when they say “tok/s” — it dominates the experience of long responses. Measure decode separately from prefill (the time-to-first-token phase), because the two phases run on different code paths and a single tok/s figure that averages them is misleading at long context.

TTFT — time to first token — is how long from request submission until the first decoded token leaves the model. For chat workloads this is dominated by prompt length and KV-cache prefill speed; for short prompts it is nearly instant; for long-context retrieval it can dominate total latency. Report it separately. VRAM peak is the highest VRAM used during the run, which can be meaningfully larger than the steady-state usage shown by nvidia-smi at idle. Sustained tok/s under a 10-minute load tells you how the rig behaves once it has heat-soaked; many setups peak at one number and settle at a slower one. Power draw, if you have a wall meter or PDU readout, lets readers compute perf-per-watt for themselves.

What NOT to claim

Five claims you should never publish. None of them are about your honesty; they are about the structural ways tok/s measurement misleads.

Do not present a burst tok/s reading as if it were sustained. First-token bursts on cold caches are routinely 30 to 50 percent above steady state. Do not publish a number that came from a single run. Three to five runs, median or trimmed mean, costs almost nothing and protects you from one bad sample. Do not publish results from a thermally throttled run. If your GPU temperature crossed 83 C on NVIDIA or 95 C on AMD during the run, the result is not a measurement of your stack — it is a measurement of your cooling. Note it and re-run.

Do not compare your numbers against someone else’s published number unless you have matched their version metadata. “I get faster tok/s than X” is a meaningless statement when X used vLLM 0.6.x and you used llama.cpp b3447. And do not present a benchmark of an outdated runtime version as if it represents the current state of that engine; runtimes move quickly enough that a six-month-old number can be flat wrong about today’s performance.

Reproducibility floor

A benchmark is not a number; it is a number plus enough metadata for someone else to land within ten percent of it. That bundle is the reproducibility floor. Five elements, all small.

Record the exact command line you ran, including flags. The difference between --n-gpu-layers 99 and --n-gpu-layers 35 is the difference between a GPU benchmark and a CPU+GPU benchmark. Save the full stdout and stderr from the run; the runtime usually prints the model file path, the quant detected, the context size, and sometimes the realized tok/s, all of which are sanity checks for the operator reproducing your work. Snapshot the driver and tool versions: the output of nvidia-smi, rocm-smi, or system_profiler SPDisplaysDataType. Note the OS version. Note the date the run was performed.

We accept a real limitation here: most public benchmarks lack thermal-state notation, and the catalog at RunLocalAI does not require it for community submissions. We treat thermal state as best-effort metadata. If you have it, include it. If you do not, your submission still counts; it just does not clear the bar for thermal-sensitive comparisons (e.g., laptop vs. workstation runs of the same triple).

Honesty discipline

Share the negative results. “Mistral Large 2 Q4 OOMed at 16k context on my 24 GB card” is as useful to the next operator as a number that succeeded; both spare them an afternoon. The local AI community is small enough that publishing only success cases distorts the public picture of what runs where. The catalog accepts negative submissions with the same reproducibility floor.

Flag thermal state explicitly when you have it. “GPU peak temperature 79 C; ambient 22 C; case fans at default curve” is enough. If the rig was thermally throttling during the run, say so and re-measure once cooled. If you are running a laptop, say which power profile and whether the unit was plugged in. None of these caveats reduce the credibility of your number; they protect it.

Be honest about the version metadata when you cannot fully pin it. “llama.cpp main, built from source on 2026-04-15” is better than “latest llama.cpp.” If you cannot recover the exact commit hash, say so — that is a known gap and the catalog will mark the row accordingly rather than pretending the metadata is complete.

Submitting to RunLocalAI

If you want your numbers in the catalog, the front door is /submit/benchmark. The form takes the triple (model, hardware, runtime), the measurement bundle (decode tok/s, TTFT, VRAM peak, context, quantization), and the metadata bundle (runtime version, driver, OS, command line if you have it). If your submission is a complete evaluation rather than a single triple, use /submit/evaluation instead.

The moderation queue takes one to seven days. We do not auto-publish. Submissions that pass review go live with attribution; submissions that fail review stay private and the submitter gets a one-line explanation of what was missing. The full editorial pipeline is documented at /resources/benchmark-lifecycle.

For YouTube creators

Local AI benchmarking on YouTube has a credibility problem that this checklist exists partly to solve. The format rewards big numbers and quick cuts, neither of which leaves room for runtime versions and command-line flags. The result is a glut of videos showing “X tok/s” without showing the command, the runtime version, or the model file — which means the number is unverifiable and, often, wrong.

The fix is one bullet in the description. Paste the printable checklist below; fill in the runtime version, model file name and short hash, the GPU model, and the context length. Show the command on screen for two seconds while it runs. Then publish whatever number you measured. Editorial transparency is the differentiator from content-mill content, not the headline tok/s — viewers who care about local AI can tell the difference, and they are the viewers who subscribe.

The 10-item printable checklist

Copy-paste into your blog post, video description, or GitHub issue. The intention is that anyone who fills in all ten produces a citation-grade benchmark.

  1. Runtime version pinned (e.g. llama.cpp b3447).
  2. Driver version recorded from nvidia-smi / rocm-smi / system_profiler.
  3. Model file by full name, plus first 8 chars of sha256sum.
  4. Quantization stated explicitly (Q4_K_M, FP8, AWQ, GPTQ-Int4, etc.).
  5. Context length for the measured run.
  6. Hardware — GPU model, system RAM, CPU, cooling state at run start.
  7. Decode tok/s, separated from prefill / TTFT, median over 3-5 runs.
  8. VRAM peak + sustained tok/s under 10-minute load if measured.
  9. Exact command line with all flags preserved.
  10. Date of run + OS version + thermal notes (if any).

Embed snippet

<a href="https://runlocalai.co/resources/benchmark-methodology-checklist" rel="noopener">RunLocalAI: Benchmark Methodology Checklist</a>

License: CC-BY-4.0.

Next steps

If your run met the checklist, the catalog wants it.