RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
← Home

>Stack Builder

Eight inputs — use case, budget, scale, privacy posture — and we compose the full rig: GPU + runtime + 1-3 model picks + first-run workflow + cost rollup + ready-to-paste install script. Three tiers side-by-side so the upgrade path stays visible.

Every recommendation references rule-based scoring; measured tok/s carries a confidence chip when surfaced. We don't invent numbers — when the data isn't there we say so.

Tell us about your build

URL updates as you change fields — share or bookmark a result. Showing the balanced default — change any field to refine.

Side-by-side: budget vs balanced vs stretch

one step down · your inputs · one step up
Budget

One step down on budget. What you give up; what you keep.

GPU
NVIDIA GeForce RTX 5090 Mobile· 24 GB
Runtime
Ollama
Top model
Qwen 3 14B· Q4_K_M
3-yr TCO
—
Balanced
~$899

Your inputs, our recommendation. Read the full card below.

GPU
NVIDIA GeForce RTX 3090· 24 GB
Runtime
Ollama
Top model
Qwen 3 14B· Q4_K_M
3-yr TCO
$1,046
Stretch
~$2,500

One step up on budget. What you'd gain; what it costs.

GPU
NVIDIA L4· 24 GB
Runtime
Ollama
Top model
Qwen 3 14B· Q4_K_M
3-yr TCO
$2,530

Your recommended stack

full breakdown — read top to bottom
Balanced — recommended
nvidia24 GB VRAM~$899

NVIDIA GeForce RTX 3090 + Ollama + Qwen 3 14B

§ Hardware
Qwen 2.5 Coder 32B Q4 + 32K context
Expected throughput: 30-60 tok/s on 32B Q4 single-stream; 80-130 tok/s on 13B Q4.
·Estimated(rule-based scoring)Full hardware page →
§ Runtime
Ollama

Default pick for most operators: one binary, automatic GPU detection, OpenAI-compatible HTTP API at `:11434`. Sufficient for solo + small-team workloads.

  • ›Install: curl -fsSL https://ollama.com/install.sh | sh
  • ›Pull a model: ollama pull <model>:<tag>
  • ›HTTP API at http://localhost:11434
§ Model picks (2)
  • Qwen 3 14B
    14B params
    Q4_K_M
    ~8.6 GB

    Strongest general-purpose model at 14B in 2026. Multilingual tokenizer (1.7× more efficient on Turkish/Asian languages than Llama). Reasoning mode available.

    C
    Community-reported·30-45 tok/s on 16GB VRAM
  • Phi-4 14B
    14B params
    Q4_K_M
    ~8.5 GB

    Microsoft's reasoning-focused 14B trained on heavy synthetic data. Beats Llama 3.1 8B on math/code benchmarks. Weaker creative writing.

    Ed
    Editorial·30-45 tok/s on 16GB VRAM
§ First-run workflow
✓ Curated stack match
Your inputs match the editorial stack Build a local coding-agent stack (May 2026). That page has the field-tested version of this recipe with concrete commands and a why-not-the-alternative for each pick.
  1. Install Ollama on Linux.
  2. Pull the primary model: ollama pull qwen3-14b
  3. Verify it runs: ollama run qwen3-14b — type a test prompt.
  4. Connect a coding agent: install Cline (VS Code extension) or Aider (CLI) and point it at http://localhost:11434 (Ollama) or :8000 (vLLM).
  5. Verify the full loop end-to-end before adding observability, monitoring, or any second model.
§ Total cost of ownership (3-year)
Upfront
$899
hardware
Monthly electricity
$4
at 0.84 kWh/day
3-year total
$1,046
upfront + electricity
Cloud equivalent
$0
same token volume
Break-even
—
local beats cloud
Tok/s assumed
0.0
unknown
Assumptions: 4 hr/day active, 60% utilization, 3-year amortization, $0.30/M token cloud equivalent. Tune in cost calculator →

Install script

copy and paste — this gets you to first token
#!/usr/bin/env bash
# RunLocalAI stack installer — generated by /stack-builder
# Use case: coding
# Hardware: NVIDIA GeForce RTX 3090
# Runtime: Ollama

set -e

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull the primary model
ollama pull qwen3-14b

# 3. Sanity check
ollama run qwen3-14b "Hello — respond in one short sentence to confirm you're running."

# 4. Verify the HTTP API
curl http://localhost:11434/api/tags

Where to go from here

GPU chooser →

Just the hardware-pick question, with side-by-side compare, price/perf scatter, and score breakdown per dimension.

Custom build engine →

Reverse direction: I have this hardware — what fits? Use this to validate the recommendation against your actual rig.

Quant Advisor →

Drill into the model picks: Q4_K_M vs Q5_K_M vs Q8 on your specific VRAM, with quality curve + VRAM fit visualization.

TCO calculator →

Tune every assumption: utilization, electricity rate, cloud equivalent rate, amortization horizon.

Curated stacks →

18 hand-curated stack recipes for specific outcomes (coding agent, offline RAG, dual-3090, Mac cluster, iPhone, etc.)