Benchmark protocol (V36.51) — how RunLocalAI captures its measured rows

Operator-grade capture protocol used on the 6 owned rigs. Designed to be reproducible by any reader with matching hardware. Built around a CLI intake script that auto-detects log format, captures rig environment, and rejects single-shot measurements.

The 6 rigs in scope

Every source=owner row on the site is measured on one of the six devices declared in /about and the author bio. Anything outside this set is routed through community contribution or labelled as estimated.

The standard prompt

Every measurement under this protocol uses the same prompt so results across rigs and across runtimes are directly comparable.

Write a detailed explanation of how transformer attention works.

Roughly 16 prompt tokens with most tokenizers. The intake script records the sha256 prefix of each concatenated run log on every insert so the prompt can be verified from the stored manifest.

Per-tuple protocol

For each (hardware, model, quant, context) tuple the operator runs:

  1. Cold-start prep. Reboot or idle the GPU for ≥5 min. Close other GPU-using apps. On laptops: plug in, set power plan to Performance.
  2. Run 1 — warm-up, discarded. First inference is colder than subsequent ones (weights still being mapped). Not included in the median.
  3. Runs 2, 3, 4 — measurement. Three consecutive runs with the same prompt + identical CLI invocation. Each run's full output captured to its own log file.
  4. Required metrics. decode tok/s, TTFT (time-to-first-token), prefill tok/s, generated-token count.
  5. Reproducibility floor. Variance ≥20% across runs triggers a re-run. ≥3 runs with <20% variance → reproduced=true, confidence=high.

Capture commands

Ollama

ollama run llama3.1:8b --verbose "Write a detailed explanation of how transformer attention works." > run1.log 2>&1
ollama run llama3.1:8b --verbose "Write a detailed explanation of how transformer attention works." > run2.log 2>&1
ollama run llama3.1:8b --verbose "Write a detailed explanation of how transformer attention works." > run3.log 2>&1

llama.cpp

llama-cli -m models/llama-3.1-8b-instruct.Q4_K_M.gguf \
  -p "Write a detailed explanation of how transformer attention works." \
  -n 200 -ngl 99 -c 4096 > run1.log 2>&1

LM Studio uses llama.cpp underneath — capture from the Developer → Logs (Verbose) panel.

Intake

The V36.51 CLI ingests the three logs and writes the DB row in one command. It auto-detects Ollama vs llama.cpp vs vLLM format, captures driver / runtime / OS via nvidia-smi + ver + ollama --version, computes the 3-run median with min/max range, and rejects insertion if the variance is suspiciously high or the hardware slug is not in the owner-set.

npx tsx scripts/v36-51-benchmark-add.ts \
  --hardware rtx-5080 \
  --model llama-3.1-8b-instruct \
  --quant Q4_K_M \
  --context 4096 \
  --runtime ollama \
  --logs run1.log,run2.log,run3.log \
  --command "ollama run llama3.1:8b --verbose '...'"

The script asks for confirmation before INSERT and stores the full reproducibility manifest (driver, OS, runtime, hash, per-run detail) in the row's operator_insight_md field so the data on the public page is end-to-end traceable.

Editorial commitment

Every row inserted under this protocol can be reproduced by any reader with matching hardware + the captured command + a same-major- version runtime, within roughly the same range. If a reproduction lands wildly off (>30% delta), file an issue at /benchmarks/regressions or message via /contact — that's a measurement quality bug to fix, not noise to ignore.

The freeze-state target is one Llama 3.1 8B Q4_K_M row per owned rig (6 measurements) plus one mid-tier model row per device that fits (≤24 GB → Qwen 2.5 14B Q4), plus one large-model row per device that fits (≤40 GB → Llama 3.1 70B Q4). That's a floor of ~18 rows. Beyond that, more rows are better, but the bar is "make the claim on /about provable, not aspirational."

Last reviewed: 2026-05-11. Protocol version: V36.51. Maintained by the bylined operator on every measurement — Fredoline Eruo.