Trust evolution

Trust at RunLocalAI

A benchmark site is only useful if you can trust the numbers. This page explains, in plain English, how we earn the right to publish one. The four-state ladder a measurement climbs, the three things we have promised never to do, and the parts of the system where we are honest about what we cannot prove.

By Fredoline Eruo · Last reviewed 2026-05-07

Why trust is the product

Anyone can scrape model cards, list a few GPUs, and call the result a benchmark site. The hard part — the part that takes years and is the reason this site exists — is being a place where the number on the page is the number you would get if you ran the same configuration on the same hardware tonight. Approximately. Within known noise. With the caveats clearly listed instead of hidden in a footnote.

That is the entire product. Not the directory, not the calculators, not the comparison pages. The product is: an operator landing here from a search engine can trust the row they are reading. Everything else exists to support that one promise. The trust apparatus on this page is the contract that makes the rest of the catalog meaningful.

We document the apparatus publicly because the alternative — a proprietary “trust score” that nobody can audit — is what every competitor does. We refuse it. If our methodology cannot survive being read by a skeptical engineer, the methodology is the problem and we should fix it, not hide it.

The four-state trust ladder

Every benchmark row on the site sits at one of four states. The state is the answer to a single question: how confident are we that this number reflects reality? The states form a ladder; a row enters at the bottom and climbs as evidence accumulates.

Community submitted

Community submitted. An operator has submitted the measurement, an editor has reviewed it for plausibility and metadata completeness, and it has been published. One source. Treat the number as directional — useful, but not yet load-bearing for a hardware purchase decision.

Reproduced

Reproduced. A second operator has run the same configuration and arrived within ±15% of the original. Two sources, agreement. The measurement now has a real claim to being a measurement of the world rather than of one particular machine.

Independently reproduced

Independently reproduced. Two or more independent operators agree with the original. Three sources or more, agreement. This is the strongest community signal we publish. It is more rigorous than the headline numbers in most peer-reviewed AI papers, and we say that because it is true.

Editorial

Editorial. Measured by RunLocalAI on hardware listed on the About page, using the exact protocol in the reproduction guide. The author byline is a named human; the run date is recorded; the reproducible command is published. A well-resourced reader can re-run the measurement and check our work.

The ladder is intentionally short. A longer ladder with more intermediate rungs would imply finer discrimination than the underlying signal supports. Four states is the most we can meaningfully distinguish; collapsing to three would lose the gap between “one operator says so” and “two operators say so,” which is a step change rather than an incremental gain.

Two further signals can layer on top of any state on the ladder: Stale for measurements older than 18 months, and Verified owner for submissions from operators we have editorially reviewed as hardware owners (see the operators page for what that actually means and what it does not). Stale and verified-owner are modifiers; they do not move a row up or down the ladder by themselves.

Three promises we have made

Three rules constrain the entire system. They were not chosen because they were convenient. They were chosen because each one rules out a specific failure mode that other benchmark sites routinely exhibit.

1. We never auto-publish

No submission becomes public without an editor reviewing it. Not rate-limited public, not tentatively-public, not provisionally public — never public until a named editor has read the row and approved it. Most submissions are approved within 48 hours; some are not approved at all. Rejected submissions never appear on the site, anywhere, including in archive views or search results. The rejection is final and unrecoverable.

The discipline matters because the cheap version of a benchmark site is one where every submission auto-publishes and a moderation queue catches problems after the fact. That model produces a site where the median row is unverified. Editorial review before publication is slower; it is also the entire reason the median row on this site is something an operator can act on.

2. We never publish percentages

The confidence engine internally accumulates a numeric score, but the score never reaches the page. We render four tier labels — low, moderate, high, very-high — and that is the entire vocabulary. A “78% confidence” pill on a benchmark row would be false precision; the underlying inputs are heuristics. Tier labels are honest about what the engine actually knows.

The same discipline applies elsewhere. Catalog scores render as tiers, not percentages. Decode rates round to one decimal. TTFT rounds to the nearest 10ms. Wherever a number could be reported to a precision that exceeds the noise floor, we round it down. False precision is operator-hostile.

3. We never invent numbers on sparse data

If we do not have a benchmark for a given model on a given GPU, we do not interpolate. We do not extrapolate from the next-larger card. We do not derive a tok/s from the model card and the memory-bandwidth spec. We render the empty state honestly — a card that says “no benchmarks yet, submit one” with a link to the submission form.

The will-it-run engine computes predicted feasibility based on VRAM math. That is a different thing from a predicted decode rate, and the engine is careful to distinguish them. A feasibility prediction says “this configuration should fit; whether it runs well is a separate question.” A predicted decode rate would be a manufactured number, and a manufactured number is exactly the failure we built the trust apparatus to prevent.

What we cannot prove

Three places where we are honest about the limits of our verification. The first defense against being misleading is being precise about what we are not claiming.

Hardware ownership. When a community submission carries the verified-owner modifier, we mean an editor has reviewed evidence — build photos, prior contributions, public posts about the hardware — and concluded the operator probably owns the machine they claim to own. We cannot prove it. There is no cryptographic attestation chain for a workstation under a desk. Verified-owner is editorial judgment, not a credential, and the operators page documents exactly what evidence we look for and where the limits are.

Reproducibility at scale. Editorial benchmarks ship a reproducible command. We cannot guarantee that every reader running the command on their own hardware will get the same number. Hardware varies, drivers vary, OS configurations vary, thermal headroom varies. We can guarantee that the command we published is what we ran, that the hardware in the byline is the hardware we ran it on, and that the number we recorded is the number we measured. Beyond that, the reproduction guide documents the protocol; the rest is the world being noisy.

Long-term accuracy. Runtime ecosystems move fast. A benchmark from twelve months ago on llama.cpp 1.0.4 is not a benchmark on whatever llama.cpp is shipping today. We retire very-stale benchmarks with explicit stale signals; we do not silently update old rows with new runtime behavior. Every benchmark on the site is a measurement at a moment in time, and the trust apparatus exists in part to make that moment legible — so a reader can decide for themselves whether the moment is still relevant to their decision.

Deeper reads

Three pages drill further into specific parts of the apparatus. Each is the operator-actionable detail behind one of the sub-systems summarized above.

/trust/benchmarks — the confidence engine factors, rejection criteria, reproduction methodology, stale-data retirement.
/trust/editorial — the editorial review process, conflict-of-interest discipline, scoring methodology, and how editorial accountability is logged internally.
/trust/operators — what verified-owner means in practice, the evidence we accept, the identity information we never require, the hideContributor flag, and how reputation is editorially earned rather than algorithmically scored.

Where to go next

The confidence engine factors, rejection criteria, reproduction methodology, and stale-data retirement — the most operator-actionable of the three child pages.

How benchmarks earn confidence

OrEditorial review process Operator verification process