RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Benchmarks
  4. /Cohorts
✓Editorial(Live coverage report)

Benchmark cohort coverage

The intelligence graph compares your benchmark to its cohort — same model, same hardware, same quant bucket, same context bucket. Cohorts under 5 measurements can't produce confident outlier flags. This page surfaces which cohorts have signal and which are underpowered.

The cohorts ranked first below are ones where one or two more measurements would unlock real intelligence. If you have the rig, the “reproduce” CTA on each row prefills the submission form.

Total cohorts
16
Very-high tier
0
Underpowered
16
Single-runtime only
16

Cohorts where one more measurement matters

Ranked: low / moderate confidence first, then proximity to the 5-row outlier-detection threshold, then recency. A measurement landing on any of these tips it across the line.

CohortConfidenceRowsReproducedLatestAction
llama-3.1-8b-instruct on apple-m4-max
unknown · 16-32K
  • · Only 2 measurements; intelligence graph cannot draw conclusions.
  • · All measurements use mlx-lm — runtime-drift signal absent until a second runtime lands.
Low
202026-05-06Reproduce →
qwen-2.5-coder-32b-instruct on rtx-4090
4-bit · 16-32K
  • · Only 2 measurements; intelligence graph cannot draw conclusions.
  • · All measurements use vllm — runtime-drift signal absent until a second runtime lands.
Low
202026-05-06Reproduce →
llama-3.3-70b-instruct on rtx-4090
4-bit · ≤4K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use llama-cpp — runtime-drift signal absent until a second runtime lands.
Low
102026-05-13Reproduce →
llama-3.3-70b-instruct on apple-m3-ultra
4-bit · ≤4K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use mlx-lm — runtime-drift signal absent until a second runtime lands.
Low
102026-05-13Reproduce →
llama-3.1-8b-instruct on rtx-3090
4-bit · ≤4K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use llama-cpp — runtime-drift signal absent until a second runtime lands.
Low
102026-05-13Reproduce →
llama-3.1-8b-instruct on rtx-5090
4-bit · ≤4K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use llama-cpp — runtime-drift signal absent until a second runtime lands.
Low
102026-05-13Reproduce →
llama-3.1-8b-instruct on apple-m3-max
4-bit · ≤4K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use mlx-lm — runtime-drift signal absent until a second runtime lands.
Low
102026-05-13Reproduce →
llama-3.1-8b-instruct on rtx-4090
4-bit · ≤4K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use llama-cpp — runtime-drift signal absent until a second runtime lands.
Low
102026-05-13Reproduce →
llama-3.1-8b-instruct on rtx-5080
4-bit · ≤4K
  • · Single-source cohort; nothing to compare against.
Low
112026-05-11Reproduce →
qwen-2.5-coder-7b-instruct on rtx-3080-16gb-mobile
4-bit · 4-8K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use ollama — runtime-drift signal absent until a second runtime lands.
Low
102026-05-10Reproduce →
deepseek-r1-distill-qwen-32b on rtx-4090
4-bit · 16-32K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use vllm — runtime-drift signal absent until a second runtime lands.
Low
102026-05-06Reproduce →
llama-3.1-8b-instruct on rtx-5080
4-bit · 4-8K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use ollama — runtime-drift signal absent until a second runtime lands.
Low
102026-05-05Reproduce →
llama-3.1-8b-instruct on rx-7900-xtx
4-bit · 4-8K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use ollama — runtime-drift signal absent until a second runtime lands.
Low
102026-05-04Reproduce →
phi-4-14b on rtx-4060-ti-16gb
4-bit · 4-8K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use ollama — runtime-drift signal absent until a second runtime lands.
Low
102026-05-04Reproduce →
qwen-3-32b on rtx-4090
4-bit · 16-32K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use vllm — runtime-drift signal absent until a second runtime lands.
Low
102026-05-03Reproduce →
llama-3.3-70b-instruct on rtx-4090
4-bit · 4-8K
  • · Single-source cohort; nothing to compare against.
  • · All measurements use ollama — runtime-drift signal absent until a second runtime lands.
Low
102026-05-02Reproduce →

How cohort confidence is derived

Cohort labels mirror the per-benchmark confidence engine: low / moderate / high / very-high. Never percentages.

  • Very-high: ≥5 measurements + ≥2 reproductions.
  • High: ≥5 measurements, reproduction count low.
  • Moderate: 3-4 measurements, below the outlier-detection threshold.
  • Low: 1-2 measurements, single-source. The intelligence graph cannot draw conclusions.

A cohort that's last-touched >18 months ago gets demoted one tier — runtime + driver drift since then is real. A cohort that has only one runtime represented gets called out; runtime-drift signal is absent until a second runtime lands.

Next recommended step

Editorial-curated benchmark opportunities ranked by impact.

See the public benchmark roadmap
OrSubmit a benchmarkBrowse benchmarks