Editorial(Live coverage report)

Benchmark cohort coverage

The intelligence graph compares your benchmark to its cohort — same model, same hardware, same quant bucket, same context bucket. Cohorts under 5 measurements can't produce confident outlier flags. This page surfaces which cohorts have signal and which are underpowered.

The cohorts ranked first below are ones where one or two more measurements would unlock real intelligence. If you have the rig, the “reproduce” CTA on each row prefills the submission form.

Total cohorts

Very-high tier

Underpowered

Single-runtime only

Cohorts where one more measurement matters

Ranked: low / moderate confidence first, then proximity to the 5-row outlier-detection threshold, then recency. A measurement landing on any of these tips it across the line.

Cohort	Confidence	Rows	Reproduced	Latest	Action
llama-3.1-8b-instruct on apple-m4-max unknown · 16-32K · Only 2 measurements; intelligence graph cannot draw conclusions. · All measurements use mlx-lm — runtime-drift signal absent until a second runtime lands.	Low	2	0	2026-05-06	Reproduce →
qwen-2.5-coder-32b-instruct on rtx-4090 4-bit · 16-32K · Only 2 measurements; intelligence graph cannot draw conclusions. · All measurements use vllm — runtime-drift signal absent until a second runtime lands.	Low	2	0	2026-05-06	Reproduce →
llama-3.3-70b-instruct on rtx-4090 4-bit · ≤4K · Single-source cohort; nothing to compare against. · All measurements use llama-cpp — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-13	Reproduce →
llama-3.3-70b-instruct on apple-m3-ultra 4-bit · ≤4K · Single-source cohort; nothing to compare against. · All measurements use mlx-lm — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-13	Reproduce →
llama-3.1-8b-instruct on rtx-3090 4-bit · ≤4K · Single-source cohort; nothing to compare against. · All measurements use llama-cpp — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-13	Reproduce →
llama-3.1-8b-instruct on rtx-5090 4-bit · ≤4K · Single-source cohort; nothing to compare against. · All measurements use llama-cpp — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-13	Reproduce →
llama-3.1-8b-instruct on apple-m3-max 4-bit · ≤4K · Single-source cohort; nothing to compare against. · All measurements use mlx-lm — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-13	Reproduce →
llama-3.1-8b-instruct on rtx-4090 4-bit · ≤4K · Single-source cohort; nothing to compare against. · All measurements use llama-cpp — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-13	Reproduce →
llama-3.1-8b-instruct on rtx-5080 4-bit · ≤4K · Single-source cohort; nothing to compare against.	Low	1	1	2026-05-11	Reproduce →
qwen-2.5-coder-7b-instruct on rtx-3080-16gb-mobile 4-bit · 4-8K · Single-source cohort; nothing to compare against. · All measurements use ollama — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-10	Reproduce →
deepseek-r1-distill-qwen-32b on rtx-4090 4-bit · 16-32K · Single-source cohort; nothing to compare against. · All measurements use vllm — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-06	Reproduce →
llama-3.1-8b-instruct on rtx-5080 4-bit · 4-8K · Single-source cohort; nothing to compare against. · All measurements use ollama — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-05	Reproduce →
llama-3.1-8b-instruct on rx-7900-xtx 4-bit · 4-8K · Single-source cohort; nothing to compare against. · All measurements use ollama — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-04	Reproduce →
phi-4-14b on rtx-4060-ti-16gb 4-bit · 4-8K · Single-source cohort; nothing to compare against. · All measurements use ollama — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-04	Reproduce →
qwen-3-32b on rtx-4090 4-bit · 16-32K · Single-source cohort; nothing to compare against. · All measurements use vllm — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-03	Reproduce →
llama-3.3-70b-instruct on rtx-4090 4-bit · 4-8K · Single-source cohort; nothing to compare against. · All measurements use ollama — runtime-drift signal absent until a second runtime lands.	Low	1	0	2026-05-02	Reproduce →

How cohort confidence is derived

Cohort labels mirror the per-benchmark confidence engine: low / moderate / high / very-high. Never percentages.

Very-high: ≥5 measurements + ≥2 reproductions.
High: ≥5 measurements, reproduction count low.
Moderate: 3-4 measurements, below the outlier-detection threshold.
Low: 1-2 measurements, single-source. The intelligence graph cannot draw conclusions.

A cohort that's last-touched >18 months ago gets demoted one tier — runtime + driver drift since then is real. A cohort that has only one runtime represented gets called out; runtime-drift signal is absent until a second runtime lands.

Next recommended step

Editorial-curated benchmark opportunities ranked by impact.

See the public benchmark roadmap

OrSubmit a benchmark Browse benchmarks