Trust moat · Workflow validation

Workflow validation methodology

A workflow on RunLocalAI is the layer above stacks — an end-to-end pattern like “offline RAG over your own documents,” “agent that runs shell commands in a sandbox,” or “voice transcription pipeline.” Each workflow names the components, declares the hardware envelope, and points to benchmarks where they exist. This page explains what “validated” means in that context, and equally important — what it does not mean.

Editorial(methodology)Operator-reviewed

By Fredoline Eruo · Last reviewed 2026-05-08

How workflows are tied to benchmarks

Every workflow declares its component shape in the registry at src/lib/workflows/registry.ts: the model class, the runtime, the optional retrieval or memory layer, the hardware envelope. That declaration is what we use to look up benchmarks relevant to the workflow. The lookup is structural — we don’t demand an exact model + hardware match; we widen progressively from “exact configuration” through “same model size class on similar hardware tier” until we have enough rows to render a hardware envelope, or we explicitly note that no rows match.

On the workflow detail page this surfaces as the “Featured in benchmarks” strip, the hardware envelope panel, and the per-stack performance hint cards. A workflow with strong benchmark linkage shows real numbers; a workflow without it shows the empty-state honestly — same discipline as the rest of the site.

What counts as “validating” a workflow

Three signals can validate a workflow, and we render each differently.

Editorial validation. The workflow has been built and run by editorial on at least one hardware tier within the declared envelope, with notes on what worked, what didn’t, and what the failure modes were. This is the strongest signal we publish for a workflow because it’s the one we can speak directly to. It carries the editorial trust pill.
Operator validation. Independent operators have submitted benchmarks on the same component shape, the configuration matches the workflow’s declared envelope, and the numbers are inside the expected range for the shape. This is the most common path for workflows in production usage on hardware editorial doesn’t own. It carries the operator-reviewed or reproduced pill depending on count.
Stack inheritance. The workflow is composed of stacks that have themselves been built and benchmarked. The workflow inherits the trust signal of its weakest component — if all three component stacks are editorially built and one is community-submitted, the workflow surfaces the community-submitted level. This is conservative on purpose.

A workflow that has none of the three is honestly an aspirational entry — documented because the pattern is real and operators ask about it, but not yet validated. Those workflows render with the unverified pill and a note inviting contribution.

Why some workflows have zero benchmark coverage

Three legitimate reasons a workflow has zero benchmark rows attached:

The workflow is too new. A pattern that’s been viable for six weeks doesn’t have a backlog of submissions yet. The workflow exists in the catalog because the pattern is documented, the hardware envelope is reasonable, and editorial expects coverage to fill in. We’d rather have a documented unverified workflow than a missing one.
The bottleneck isn’t throughput. Some workflows are dominated by retrieval latency, by tool-call round-trips, by an external API rate limit, or by a chain of model calls where individual decode rate is not the operator question. Decode tok/s benchmarks would be technically measurable but not informative; the catalog correctly elides them and surfaces qualitative notes instead.
The workflow is hardware-class-broad. A workflow that’s sensible on anything from an M2 Pro to a dual-3090 to an H100 doesn’t have a meaningful single envelope; the per-stack pages carry the per-tier numbers and the workflow page links to those. Trying to cram the entire envelope into one row would obscure rather than clarify.

The wrong reading of zero coverage is “this workflow is broken” or “the catalog is empty.” Neither is what an empty rail means. The right reading is “this is the documented shape; the per-component pages carry the numbers; if you run it and submit a benchmark, this rail fills in.”

How reproduction strengthens workflow trust

A reproduced benchmark on a workflow’s component shape is worth more than a single editorial measurement, because reproduction means the configuration was buildable from the documentation by somebody who didn’t write it. That’s the actual claim the workflow page is making to a reader: this pattern is runnable, the docs are sufficient, the hardware envelope is honest.

Three reproduction-driven signals lift a workflow:

Cross-hardware reproduction. The same workflow shape benchmarked on two distinct hardware tiers, each within the declared envelope, demonstrates the envelope is real rather than an editorial assumption. Cross-tier reproduction is the strongest single move.
Cross-runtime reproduction. The same workflow shape on llama.cpp and on Ollama, both producing reasonable numbers, demonstrates the workflow isn’t accidentally tied to one runtime’s quirks. Some workflows genuinely are runtime-specific; reproduction sorts those out.
Failure-mode reports. A reproducer hitting the same documented failure mode (out-of-VRAM at long context, tool-call timeout under load) is itself useful signal. We surface confirmed failure modes on the workflow page; reproduced failure modes carry more weight than single-source ones.

Editorial-validated vs. operator-validated

The two terms aren’t a hierarchy — they’re a partition. Each captures a kind of trust the other can’t.

Editorial-validated workflows have been built and run by RunLocalAI’s editorial team. The strength is depth: editorial knows what broke, why, and what the workaround was, and publishes that as part of the workflow page. The limit is breadth — editorial owns a finite hardware fleet and can’t run every workflow on every tier.

Operator-validated workflows have been built and run by independent operators who submitted benchmarks. The strength is breadth: operators run on hardware editorial doesn’t have, on driver branches editorial hasn’t tested, in regional power and thermal conditions editorial can’t simulate. The limit is depth — we know it ran, we know roughly how fast; we don’t always know what subtle thing broke first or what the workaround was.

The strongest workflow trust signal is both pills present: editorial has run it, operators have reproduced it on adjacent hardware. That combination tells the reader the documentation works, the envelope is real, and edge cases are surfaced.

Known gaps in this methodology

Three honest acknowledgements about what this methodology cannot do.

Quality validation is partial. We can validate that a RAG workflow runs end-to-end and produces some output. We have very little ability to validate that the output is actually good for the operator’s document corpus — that’s a domain-specific evaluation that lives outside the catalog. The workflow page surfaces patterns, not eval scores.
Long-running stability is partial. A workflow that runs cleanly for a single benchmark might leak memory or drift over a 48-hour session. We don’t systematically measure this. Notes get added when operators report it; absence of notes is not absence of the problem.
Cost validation is bracketed. The economics-methodology assumptions discussed at /resources/local-vs-cloud-economics-methodology apply to the cost numbers a workflow surfaces. Pricing fluctuates regionally and over time; the workflow page’s $/month figures are bracketed ranges, not point estimates.

Adjacent reading

This page sits next to the verification policy for community submissions, the reproduction guide for the operator protocol, and the confidence methodology for how those signals collapse into a tier label. The versioned-benchmarking methodology documents the metadata fields a reproducer should record when submitting a workflow run.

Next recommended step

See workflow detail pages with editorial-validation badges and operator-reproduction strips.

Browse workflows

OrReproduction guide Submit a workflow benchmark

Back to /resources. See also /editorial-policy for how editorial validations are produced.