RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Workflows
  4. /Homelab AI API gateway
Homelab
Weekend build-out

Homelab AI API gateway

Self-hosted OpenAI-compatible API for everything you build. vLLM + LiteLLM + Caddy + per-app API keys + Prometheus metrics. The drop-in replacement for the OpenAI Python client across your homelab projects.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,700 words

Build summary

Hardware footprint
RTX 4090 OR dual 3090 · 64 GB RAM · 1 TB NVMe
Concurrency
5-20 concurrent personal projects (light burst per project).
Power
~300-400 W under typical homelab load.

Goal: One self-hosted endpoint that every personal project (Discord bot, Obsidian plugin, IDE, scripts) can hit with no cloud cost.

Operator card

Workflow
Best for
  • ✓Homelab operators with 3+ side projects calling LLMs
  • ✓Anyone tired of OpenAI/Anthropic monthly invoices
  • ✓Privacy-conscious devs who don't want their prompts in cloud logs
  • ✓Engineers building / testing OAI-compatible clients
Avoid if
  • ⚠You only have one project (just run Ollama directly)
  • ⚠You need >50 concurrent users (production tier)
  • ⚠You're not comfortable maintaining 3-4 Docker containers
Stability
stable
Maintenance
Monthly check
Skill
Intermediate
Long-session reliability
reliable

Service ledger

7 services across 2 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute
vLLM
Inference
8000/tcp
Primary inference. Continuous batching makes this the right backend when N small clients are firing 1-shot requests on overlapping schedules.
Runs: Docker, GPU 0
Qwen 2.5 14B Instruct
Model
General-purpose model. Strong general-purpose default — chat, coding, classification, summarization. Fits the typical homelab GPU comfortably.
Runs: vLLM
Qwen 2.5 Coder 7B
Model
Coding specialist. Routed via LiteLLM when client requests model=qwen-coder. Shares the same vLLM instance via dynamic loading or runs on a second port.
Runs: second vLLM container
nomic-embed-text
Embeddings
Embeddings. Lets every personal project use the same embeddings without hitting OpenAI. Routed via LiteLLM as model=nomic-embed-text.
Runs: Ollama
Operations
Caddy
Proxy / TLS
TLS + rate limit. Auto-TLS via Let's Encrypt; built-in rate-limiting plugin; Caddyfile is short enough to read at a glance.
Runs: host systemd
LiteLLM virtual keys
Auth
API key management. Per-app virtual keys with budgets, rate limits, model whitelisting. Acts as the IAM layer in front of the inference engine.
Runs: embedded in LiteLLM
Prometheus + Grafana
Observability
Metrics + dashboards. vLLM and LiteLLM both export Prometheus natively; Grafana dashboards for per-key spend, latency, queue depth.
Runs: Docker compose

Hardware

Single RTX 4090 is sweet-spot. Dual 3090 unlocks 70B serving for the rare project that needs it.

The workload pattern matters more than peak throughput here. Most homelab API calls are short bursts (Discord bot reply, Obsidian summarization, script classification). vLLM's continuous batching shines exactly for this — many short requests with tiny KV-cache footprints share GPU well.

Aim for 80% peak GPU utilization average, 100% at burst. >100% sustained means you've outgrown a single card.

Storage

~50 GB for model weights, ~5 GB for LiteLLM logs (rotate weekly), trivial for everything else.

If you log full request/response pairs (LiteLLM has a flag), prompt history grows fast — ~10 KB per call. 10K calls/day = 100 MB/day. Rotate aggressively or pipe to Loki with a 30-day retention.

Networking

Bind LiteLLM to 127.0.0.1, Caddy to 0.0.0.0:443. Caddy terminates TLS, forwards to LiteLLM. LiteLLM forwards to vLLM / Ollama on private Docker network.

For LAN-only access: bind Caddy to LAN interface, no public DNS.

For remote access: Tailscale OR Cloudflare Tunnel + Cloudflare Access (gate by Google SSO email).

NEVER expose vLLM or Ollama directly to the internet — they have no auth and no rate-limiting.

Observability

Per-key metrics are the operational gold:

  • Calls per minute per virtual key. Catch runaway scripts before they exhaust GPU.
  • Tokens per call distribution. Long-tail of huge prompts often = a bug somewhere.
  • Cost-equivalent (vs OpenAI). LiteLLM tracks "you saved $X this month" — fun and useful.
  • vLLM queue depth. Sustained > 5 means you've outgrown the single-card tier.

Security

Per-app keys. Issue a new virtual key per project. Set per-key budgets (1M tokens/month) so a runaway script can't hose the whole gateway.

Rate limits in LiteLLM. Cap requests/min per key.

Model whitelisting. Let the Discord bot only call qwen-7b; let your Aider workspace call qwen-coder-32b. Lock down high-cost models behind keys you actually trust.

Caddy TLS. Don't disable it. The cost-of-Let's-Encrypt is zero.

Audit log. LiteLLM writes a per-call log; pipe to Loki + retain 30 days. Catch abuse early.

Upgrade path

More models: drop them in via LiteLLM config; LiteLLM auto-discovers the new model from vLLM.

Higher concurrency: scale vertical (better GPU) before you scale horizontal — homelab Ray Serve adds operational pain that's rarely worth it solo.

Cloud hybrid: LiteLLM can route specific models to OpenAI / Anthropic / Together when needed. The client code stays the same.

Long-running jobs: add a Celery / RQ queue for batch jobs (transcription, summarization at scale). Don't pollute the synchronous API path.

What breaks first

  1. Quota miscalibration. A new project's first month always blows the budget. Set a tiny initial quota per key, raise it after observing actual usage.
  2. vLLM image bumps drop OAI compat. Pin the SHA. Read changelogs before bumping.
  3. LiteLLM database growth. Per-call logging grows fast. Rotate or move to Postgres for retention.
  4. Caddy auto-renew failure if your DNS provider rate-limits. Use DNS-01 challenge with a controlled provider (Cloudflare).
  5. GPU fan failure on dust buildup. Quarterly cleaning isn't optional on a 24/7 homelab.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/rtx-4090-workstation →/stacks/dual-3090-workstation →
Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

unvalidated

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 2 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

  • Unvalidated
    qwen-2.5-14b-instruct via vllm
    • · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
    0 benchmarksSubmit the first benchmark →
  • Unvalidated
    qwen-2.5-coder-3b via vllm
    • · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
    0 benchmarksSubmit the first benchmark →
✓EditorialValidate this workflow →See benchmark roadmap →How validation works →
Help keep this page accurate

We read every submission. Editorial review takes 1-7 days.

Report outdatedSuggest a correctionDid this workflow work for you?