Reproducible eval scores
EditorialLocal AI evaluations
Community-submitted lm-evaluation-harness scores on local models running on local hardware. Distinct from /benchmarks (tok/s + VRAM). Reproducibility comes from the pinned harness commit + runtime version + exact command line.
Got an eval to share? Submit it — moderation takes 1-7 days; we never auto-publish.
No public evaluations yet. Be the first to submit one via /submit/evaluation.