Trust moat · Confidence engine

How RunLocalAI computes benchmark confidence

Every benchmark row carries a trust badge (Editorial, Reproduced, Community submitted) and a confidence tier (low, moderate, high, very-high). The badge captures verification state — who measured it, who reproduced it. The confidence tier captures everything else that affects how much weight you should put on the number: age, completeness, internal consistency, runtime-version drift. This page documents how the tier is computed.

By Fredoline Eruo · Last reviewed 2026-05-07

Two ladders, one trust model

RunLocalAI publishes confidence in two places that look similar but answer different questions. The four-tier row confidence ladder on this page answers “how much should I trust THIS benchmark row?” The five-tier Score-anchor confidence ladder on the Will-It-Run methodology page answers “where did the headline RunLocalAI Score come from?” The two are related — a Score reading “Measured” is anchored to a row at high or very-high row confidence — but the vocabularies are deliberately separate so readers can locate themselves in the right surface.

Row confidence (this page)→ Score-anchor confidence
Very-high (independently reproduced)M (Measured)
High (editorial-measured + recent)M (Measured)
Moderate (same hw, different ctx/quant)M~ (Measured-near) or C (Community)
Low (single-source, unverified)~ (Extrapolated) or E (Estimated)

The four row-confidence tiers

We output exactly four tier labels per benchmark row. Adding more would imply a precision the underlying signals don’t support; collapsing to fewer would make the tier uninformative.

  • Low. Single-source, recent, plausible-but-unverified. Most fresh community submissions land here at first. Treat the number as directional information, not as a measurement you should base a hardware purchase on. A row stays at low until either the metadata fills in or another operator reproduces it.
  • Moderate. Either single-source with a strong contributor track record, or multi-source with one disagreement, or editorial-measured but stale. Worth treating as the working baseline for the configuration but with a half-step of caution baked in. Many rows live here long-term — completeness without reproduction is a moderate tier on purpose.
  • High. Editorial-measured and recent, or community-reproduced with a clean state transition from the verification policy. Operators making a buying decision should anchor on rows in this tier when possible.
  • Very-high. Independently reproduced (two or more operators), recent runtime version, complete metadata, no outlier rows in the same configuration. The strongest signal we publish.

Why we never publish percentages

The temptation to render “78% confidence” on a benchmark row is strong because the underlying engine accumulates a numeric score. We refuse it deliberately. A percentage implies a calibration we cannot honestly back — that 78% means something specific about prediction accuracy or measurement variance, when in practice the underlying inputs are heuristics. Moderate is a tier; 73 vs 76 is noise. The tier label conveys the operator-actionable signal without the false precision.

The same discipline drives the v17 catalog scoring engine — see /resources/scoring-methodology for the matching argument applied to compatibility, runtime maturity, setup complexity, and the rest of the catalog scores. These two engines together — confidence on benchmarks, scoring on catalog dimensions — document the trust math the rest of the site rests on.

The six factors

Six inputs feed the confidence engine. Each contributes a positive or negative adjustment to an internal score, which is then mapped to a tier. The factors are listed in roughly decreasing order of impact.

1. Reproduction count

Each independent reproducer adds weight. The first reproduction is the largest single factor — it’s the difference between “one operator says this” and “two operators say this,” which is conceptually a step change rather than an incremental gain. Subsequent reproductions add weight with diminishing returns; by the fourth or fifth independent reproducer the row is already at very-high and additional confirmations don’t move it.

2. Age

Time since the original submission for community rows, or since the recorded measurement date for editorial rows. Confidence decays gradually for the first 18 months; past 18 months a row drops a tier automatically. The runtime ecosystem moves fast enough that a 2-year-old benchmark is mostly directional regardless of how clean the original measurement was.

3. Variance across rows

When multiple rows exist for the same model + hardware + runtime configuration, the standard deviation across the rows is itself a signal. Tight clustering (all rows within ~10% of each other) lifts confidence — the configuration is stable. Wide spread (rows diverging by 30%+) drops confidence and surfaces the configuration to reviewers as something worth investigating.

4. Runtime consistency

Each measurement is anchored to a specific runtime version. When a runtime ships a major version with kernel changes — vLLM 0.6 → 0.7 with the rewritten attention path, llama.cpp dropping a major flag, ExLlamaV2 changing its prefill scheduler — older rows on the prior version drop a tier. The engine cross-references a small editorial annotation table of “version-bump impact” entries; the annotations are written by editorial reviewers when a runtime ships something that genuinely changes inference characteristics.

5. Missing fields

Every blank field in the discipline set costs the row points: blank runtime version, blank driver version, blank OS label, missing VRAM peak, missing TTFT. A row missing all of these never escapeslow, regardless of how many reproductions it accumulates, because no reproducer can match the configuration cleanly. This is the easiest factor for a contributor to address — fill in the fields the form asks for, and the row jumps a tier on the next moderation pass.

6. Outlier penalty

A row whose decode rate or TTFT is more than 3x the median for similar configurations gets penalized. Reviewers see the outlier flag and can either confirm the penalty (the row is suspicious and probably needs investigation) or override it (the configuration genuinely produces those numbers — usually because of an unusual driver branch or a non-default power-limit setting). The override is what protects against the engine punishing legitimately surprising results.

Concrete examples — low vs high

Two rows, same model and hardware, opposite ends of the tier ladder:

  • Low confidence row. Llama-3.1-8B Q4_K_M on RTX 3090, single submission, runtime version blank, driver version blank, OS label blank, no notes. Reproduction count: zero. Age: two weeks (factor neutral). The missing-fields penalty alone caps this row at low; nothing the contributor can do short of filling in the metadata moves it.
  • High / very-high confidence row. Same model and hardware. Two independent reproducers, all three rows on llama.cpp 1.0.7 within 5% of each other. Driver version recorded (NVIDIA 555.42 across all three). OS label populated (Ubuntu 24.04 on two, Pop!_OS 22.04 on one — close enough). Notes describe the warmup protocol. Age: four months. This row is very-high; a buyer can anchor on it.

Most rows live somewhere between these poles. The most common mid-ladder pattern is a community-submitted row with complete metadata but no reproduction yet — that row sits at moderate until someone reproduces it.

Why we round, why we don’t average across runtimes

Two rendering disciplines that the confidence engine enforces:

We round all displayed numbers. Tok/s reports to one decimal. TTFT reports to the nearest 10ms. VRAM reports to the nearest 0.5GB. The underlying data carries more precision; the display does not. Rounding prevents the false-precision trap where 47.3 vs 47.6 looks meaningful but is below the noise floor of any real benchmark run.

We do not average across runtimes. A model on llama.cpp and the same model on vLLM are fundamentally different configurations with different decode paths, different KV cache allocators, different prefill characteristics. Averaging the two to produce a “Llama 3.1 8B on RTX 3090” headline would be operator-hostile — the operator picking a runtime needs to see the runtime-specific row, not a meaningless midpoint. Per-runtime rows stay separate; comparison is the operator’s job, with the configuration laid out clearly. Within a single configuration, we do average the multiple runs of a single submission (median of three runs, per the reproduction guide).

When confidence drops automatically

Three triggers move a row down the tier ladder without any reviewer action:

  • 18-month staleness. The runtime ecosystem moves fast. A llama.cpp benchmark from late 2024 is mostly directional today; a vLLM benchmark from the same era is more directional still because vLLM’s kernel set has churned harder. The public detail page also starts rendering a Stale trust badge at this threshold.
  • Runtime-version drift. When the editorial annotation table records a major-version bump on a runtime, all rows on the prior version drop a tier. We don’t auto-republish at the new version unless someone re-runs the benchmark on it.
  • Single-source-only after 60+ days. A row published as community-submitted for two months without a reproduction caps at moderate regardless of how complete the metadata is. The signal “no operator has bothered to verify this” is itself worth surfacing.

How to nudge a row’s confidence upward

Three concrete, contributor-actionable paths:

  • Reproduce on different hardware. A row incommunity-submitted with one reproducer at the same hardware tier moves to reproduced. A reproducer at a different (but compatible) tier — RTX 4090 instead of 3090 for the same llama.cpp config — adds independent signal and pushes toward independently-reproduced. Either path lifts the row at least one tier on the confidence ladder.
  • Document driver versions. A row with a fully populated driver version field anchors to a specific GPU driver branch. Reproducers on the same branch can directly support the row; reproducers on different branches generate a useful version-drift signal. Either way, the metadata field reduces the missing-fields penalty.
  • Submit at multiple context lengths. Decode-rate vs context is the most useful curve we can publish for any configuration. A submitter who provides 2K, 8K, and 32K context rows for the same model + hardware + runtime generates internal consistency the engine can use to confirm or flag outliers, and the overall configuration moves up the ladder faster than a single-context row.

The sparse-data discipline

The most important rule the engine enforces: never interpolate where data is missing.

If we don’t have a benchmark for Llama-3.1-70B on an RTX 4080, we don’t render a “predicted” tok/s by extrapolating from the 4090 result. We render the empty state honestly — a card that says “no benchmarks yet, submit one at /submit/benchmark.” The /will-it-run engine computes predicted feasibility based on VRAM math, but it never publishes a predicted decode rate as if it were a measurement.

Empty states are honest. Interpolated numbers are not. The whole point of the trust moat is that an operator landing on a benchmark row from a search engine can trust that the number is either measured or honestly tagged otherwise. The benchmark TABLE never invents a row; the RunLocalAI Score, by contrast, will surface a bandwidth-extrapolated tok/s on the score card when no measured row exists — but always badged Extrapolated or Estimated, never as a measurement. The discipline is the same: be honest about what tier of evidence each number sits on.

Adjacent reading

This page documents the confidence engine. The verification policy documents the discrete state machine that the engine reads from. The reproduction guide documents the operator protocol that drives state transitions. And the v17 scoring methodology documents the parallel engine for catalog dimensions (compatibility, runtime maturity, setup complexity, etc.) — the same operator-language, no-marketing, tier-not-percentage discipline applied to the catalog rather than benchmarks. Together, these four pages document the trust math the rest of the catalog rests on.

See also: /editorial-policy for how editorial measurements themselves are produced, /changelog for any methodology revisions, and /submit/benchmark if you’re ready to contribute a benchmark or reproduction.