Resources
Original diagrams and reference assets for local AI. Free to embed in articles, blog posts, GitHub READMEs, slide decks — attribution appreciated. Each diagram is hand-built SVG, dark- mode aware, accessible, and dependency-free.
License: CC-BY-4.0. Suggested citation: Diagram by RunLocalAI · runlocalai.co · CC-BY-4.0
Methodology
#methodologyThe trust layer behind every score grid, benchmark badge, and confidence tier in the catalog. Operator-language formulas, the four-state verification ladder for community submissions, the reproduction protocol that lifts rows up that ladder, and the honest limits of any rule-based system.
Operator-language formulas for compatibility, runtime maturity, setup complexity, maintenance burden, stability, beginner-friendliness, Linux + mobile fit, VRAM-per-dollar, and perf-per-watt. Concrete examples per dimension; a tier- label reading guide; what scoring CAN’T capture.
Ten-step operator protocol for reproducing a published benchmark, the matching set that defines a clean reproduction, the stopwatch flow for honest tok/s and TTFT, and what to do when your numbers don’t match.
The four-state trust ladder for community benchmarks (queued / approved / reproduced / independently reproduced), the rejection criteria, the audit-log discipline, and how anonymity and credit work.
Four confidence tiers (low / moderate / high / very-high), six factors that move a row up or down, why we never publish percentages, automatic decay rules, and how to nudge a benchmark’s confidence upward as a contributor.
Interactive calculators
#calculatorsOperator-grade math, no email gate, no tracking. Pure client-side. Same formulas the engine pages use — just exposed for sharing, citing, and embedding in articles.
How much GPU memory does a model need at your context length and quant? Splits weights, KV cache, activations, and runtime overhead with milestones for 8/12/16/24/32/48/80 GB cards.
What does running a local LLM cost in electricity? Watts × hours × kWh price → $/month, with the honest ChatGPT Plus break-even comparison and system-overhead caveats.
Quantization cheat sheet
#quantization-cheat-sheetThe bits-per-parameter footprint of every quant format the will-it-run engine knows about, from FP16 down to Q2_K. Color-coded by quality tier so the trade-off is visible at a glance — production-safe ≥6 bpp, sweet spot 4–6 bpp, degraded <4 bpp. The dashed line marks the Q4_K_M production sweet spot for 24 GB cards.
BITS_PER_PARAM table the will-it-run engine uses — Q4_K_M is 4.83, not 4, because it preserves 6-bit weights on attention and FFN layers.Local AI hardware checklist
#hardware-checklistThe eight stages between a parts list and a stable local-AI rig — VRAM, bandwidth, software class, PSU, airflow, PCIe lanes, NVMe, OS. Skip one and the bottleneck moves there. Use this as a pre-purchase gate before the spec sheet wins out over the workload.
Local AI stack architecture
#local-ai-stackThe seven-layer mental model — hardware to workflow. Where each concern actually lives. Use this to explain to teammates why one runtime change doesn't cascade into hardware changes (and vice versa).
GPU memory flow under inference
#gpu-memory-flowWhere your VRAM actually goes. Model weights are the headline number, but KV cache + activations + runtime overhead consume meaningful budget — especially at long context. Lifts the most common 'why does my model OOM mid-task?' question.
Multi-GPU topology — single, NVLink, PCIe, multi-node
#multi-gpu-topologyTotal VRAM is not pooled VRAM. NVLink is not magic. PCIe-only multi-GPU is real but slower. Multi-node is bandwidth-bound. Settles the most common multi-GPU misconception in one panel.
Runtime ecosystem 2026
#runtime-ecosystemThe runtime constellation around model weights. Each engine owns a distinct OS / hardware / workload sweet spot. Use this when picking between Ollama, vLLM, llama.cpp, MLX-LM, ExLlamaV2, SGLang, TensorRT-LLM.
Local RAG architecture
#local-rag-architectureThe retrieval-augmented generation pipeline end to end: documents → chunker → embedder → vector DB → reranker → LLM → response. The reranker is the most undervalued stage; this diagram puts it where it earns its keep.
Embedding these diagrams
How to use these in your own writing without fuss.
Every diagram on this page is pure inline SVG with semantic labels. Three ways to embed:
- Screenshot the diagram from this page — the easiest path. Include the citation line.
- Right-click → save as SVG from your browser’s dev tools. Use as-is in articles or slides; the SVG is text-only, no embedded fonts, dark-mode-friendly via CSS class fills.
- Reference the source by linking to runlocalai.co/resources and naming the diagram by its anchor.
Suggested citation: Diagram by RunLocalAI · runlocalai.co · CC-BY-4.0
Found a diagram useful in something you published? We’d love to see it — drop us a note at support@runlocalai.co.
More to come
Roadmap for the visual layer.
- Shipped — Quantization-format cheat sheet (FP16 → Q2_K)
- Shipped — Hardware-buying checklist (8 stages)
- Hardware-tier decision tree ($0 → $4000+)
- Local-AI privacy checklist
- Runtime × OS compatibility matrix (already at /compatibility)
- Coding-agent architecture (OpenHands + vLLM + RAG + sandbox)
- Mobile / on-device path (NPU → runtime → small model)
All assets here are CC-BY-4.0 — same citation as the other diagrams. Suggested credit: Diagram by RunLocalAI · runlocalai.co · CC-BY-4.0