RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Workflows
  4. /Local evaluation lab
Research
Week build-out

Local evaluation lab

Run reproducible benchmarks on local models. lm-evaluation-harness + bigcode-eval-harness + custom task runners + Postgres results store + Grafana for tracking. The setup that turns 'this model feels smarter' into 'this model is +3.2 on HumanEval+'.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,700 words

Build summary

Hardware footprint
RTX 4090 OR 2× RTX 3090 · 64 GB RAM · 1 TB NVMe
Concurrency
1 active eval run; multiple queued.
Power
~400-450 W during eval; idle between runs.

Goal: Evaluate model + quant + runtime combinations against standard and custom benchmarks reproducibly.

Operator card

Workflow
Best for
  • ✓Researchers comparing model + quant + runtime combinations
  • ✓Teams choosing between open-weights candidates
  • ✓Anyone fine-tuning who needs before/after measurements
  • ✓Authors of model lineage / benchmark articles
Avoid if
  • ⚠You only need one-off vibes-check evals
  • ⚠You don't have a dedicated GPU for the lab
  • ⚠You're not willing to pin every version (eval is a discipline practice)
Stability
evolving
Maintenance
Weekly attention
Skill
Advanced
Long-session reliability
reliable

Service ledger

6 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute
vLLM
Inference
8000/tcp
Inference engine. Continuous batching makes harness runs ~3-5× faster than single-stream inference. Reproducibility is solid (deterministic seeds work).
Runs: Docker, GPU 0
Surface
lm-evaluation-harness (EleutherAI)
Router / orchestrator
Eval orchestrator. Industry-standard harness. Covers MMLU, ARC, HellaSwag, GSM8K, TruthfulQA, and most LLM benchmarks. Plug-in tasks for custom metrics.
Runs: Python venv on host
bigcode-evaluation-harness
Router / orchestrator
Coding-eval orchestrator. HumanEval+, MBPP+, MultiPL-E. Coding-specific harness; runs alongside lm-eval.
Runs: Python venv on host
Data
Postgres
Storage
5432/tcp (loopback)
Results store. Each eval run gets a row with model + quant + runtime + commit + scores. Postgres is the right shape for the time-series + query pattern.
Runs: Docker container
GitHub Actions self-hosted runner OR Buildkite agent
Queue
Eval queue. Eval runs are CI-shaped: deterministic, idempotent, want artifacts persisted. Self-hosted runner gates against your GPU schedule.
Runs: host service
Operations
Grafana (Postgres datasource)
Observability
3000/tcp
Results visualization. Per-model scoreboards, regression detection, model lineage charts. Better than spreadsheets for tracking weeks of runs.
Runs: Docker container

Hardware

Single 4090 covers 7B-32B model evaluation. Dual 3090 with NVLink lets you eval 70B-class models without renting cloud time.

Reserve one GPU for evals only. Sharing with chat / coding workloads invalidates throughput measurements; sharing with another agent may even invalidate accuracy if KV-cache eviction differs.

Power matters: throttling silently degrades scores. Run evals on a dedicated PSU rail; monitor via DCGM during runs.

Storage

Each eval run produces ~50 MB raw outputs (per-task model generations). Keep them — regression debugging needs raw output, not just aggregate scores.

Per-model: ~10-20 GB weights + ~500 MB rolling output history. Postgres results tracking adds <100 MB / year.

Back up the Postgres results DB separately — that's the irreplaceable artifact. Per-run outputs can be regenerated.

Networking

Eval lab is offline-friendly. Most harnesses pre-download datasets; once downloaded, runs are local-only.

If you publish eval results: a thin static-site renderer (Next.js / Hugo) reads from Postgres and emits a leaderboard page. Internal-only is fine for solo research.

Observability

Critical metrics during a run:

  • Eval throughput (samples/min). Drops indicate VRAM pressure or thermal throttle.
  • GPU temp during the run. >82 °C sustained → throttling possible → invalid run.
  • Sampling determinism check. Re-run a few canonical prompts at run start; compare outputs to previous-run baseline. Drift = something changed (driver, runtime, model).

Post-run:

  • Score regression detection. Grafana alert when a new run scores >2σ below its model's historical mean.
  • Reproducibility window. Same model + same harness commit + same vLLM commit should match within ~0.5%. Wider variance = something is non-deterministic.

Security

Dataset contamination. Many public benchmarks have leaked into training data. lm-eval ships with leakage-aware variants when available; use them.

Custom eval data. If you write proprietary eval suites, treat the dataset like source code — don't paste into ChatGPT, don't train on it accidentally.

Reproducibility audit trail. Every result row should reference a Git commit (harness + your custom tasks) + model SHA + runtime version. This matters when a paper reviewer asks.

Upgrade path

More benchmarks: add MT-Bench, AlpacaEval-2 (LLM-as-judge), SWE-Bench. Each has different runtime profiles; budget per-benchmark hardware time.

Multi-model parallel eval: add a second GPU; run lm-eval-harness across both with different models. Measures throughput at the cost of cross-contamination if you mistakenly share state.

Production-grade: move from Postgres-on-Docker to a managed Postgres (or local HA), add a webhook system to fire CI on new model release, automate the leaderboard page generation.

Custom harness tasks: write task definitions for your domain — coding-style, factuality-on-internal-docs, instruction-following on your prompt library.

What breaks first

  1. Driver / runtime drift mid-suite. Evals taking days span auto-update windows. Pin everything; never run NVIDIA driver upgrades during a multi-day eval campaign.
  2. Sampling non-determinism. Different vLLM versions sample differently with the same seed. Tag every run with the runtime SHA.
  3. Dataset version drift. HF datasets occasionally update; cached versions may differ from latest. Pin dataset revisions.
  4. Disk fill from raw outputs. A multi-task eval can drop 5 GB of model generations. Set up rotation.
  5. Postgres integrity. Don't run evals against the same Postgres that stores production data. One bug in a custom harness can corrupt the results table.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/rtx-4090-workstation →
Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Validation

This workflow doesn't name model + hardware specifically enough to validate. Add explicit modelSlug + hardwareSlug to services for the bridge to work.

Help keep this page accurate

We read every submission. Editorial review takes 1-7 days.

Report outdatedSuggest a correctionDid this workflow work for you?