Homelab AI API gateway
Self-hosted OpenAI-compatible API for everything you build. vLLM + LiteLLM + Caddy + per-app API keys + Prometheus metrics. The drop-in replacement for the OpenAI Python client across your homelab projects.
Build summary
Goal: One self-hosted endpoint that every personal project (Discord bot, Obsidian plugin, IDE, scripts) can hit with no cloud cost.
Operator card
- ✓Homelab operators with 3+ side projects calling LLMs
- ✓Anyone tired of OpenAI/Anthropic monthly invoices
- ✓Privacy-conscious devs who don't want their prompts in cloud logs
- ✓Engineers building / testing OAI-compatible clients
- ⚠You only have one project (just run Ollama directly)
- ⚠You need >50 concurrent users (production tier)
- ⚠You're not comfortable maintaining 3-4 Docker containers
Service ledger
7 services across 2 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.
Hardware
Single RTX 4090 is sweet-spot. Dual 3090 unlocks 70B serving for the rare project that needs it.
The workload pattern matters more than peak throughput here. Most homelab API calls are short bursts (Discord bot reply, Obsidian summarization, script classification). vLLM's continuous batching shines exactly for this — many short requests with tiny KV-cache footprints share GPU well.
Aim for 80% peak GPU utilization average, 100% at burst. >100% sustained means you've outgrown a single card.
Storage
~50 GB for model weights, ~5 GB for LiteLLM logs (rotate weekly), trivial for everything else.
If you log full request/response pairs (LiteLLM has a flag), prompt history grows fast — ~10 KB per call. 10K calls/day = 100 MB/day. Rotate aggressively or pipe to Loki with a 30-day retention.
Networking
Bind LiteLLM to 127.0.0.1, Caddy to 0.0.0.0:443. Caddy terminates TLS, forwards to LiteLLM. LiteLLM forwards to vLLM / Ollama on private Docker network.
For LAN-only access: bind Caddy to LAN interface, no public DNS.
For remote access: Tailscale OR Cloudflare Tunnel + Cloudflare Access (gate by Google SSO email).
NEVER expose vLLM or Ollama directly to the internet — they have no auth and no rate-limiting.
Observability
Per-key metrics are the operational gold:
- Calls per minute per virtual key. Catch runaway scripts before they exhaust GPU.
- Tokens per call distribution. Long-tail of huge prompts often = a bug somewhere.
- Cost-equivalent (vs OpenAI). LiteLLM tracks "you saved $X this month" — fun and useful.
- vLLM queue depth. Sustained > 5 means you've outgrown the single-card tier.
Security
Per-app keys. Issue a new virtual key per project. Set per-key budgets (1M tokens/month) so a runaway script can't hose the whole gateway.
Rate limits in LiteLLM. Cap requests/min per key.
Model whitelisting. Let the Discord bot only call qwen-7b; let your Aider workspace call qwen-coder-32b. Lock down high-cost models behind keys you actually trust.
Caddy TLS. Don't disable it. The cost-of-Let's-Encrypt is zero.
Audit log. LiteLLM writes a per-call log; pipe to Loki + retain 30 days. Catch abuse early.
Upgrade path
More models: drop them in via LiteLLM config; LiteLLM auto-discovers the new model from vLLM.
Higher concurrency: scale vertical (better GPU) before you scale horizontal — homelab Ray Serve adds operational pain that's rarely worth it solo.
Cloud hybrid: LiteLLM can route specific models to OpenAI / Anthropic / Together when needed. The client code stays the same.
Long-running jobs: add a Celery / RQ queue for batch jobs (transcription, summarization at scale). Don't pollute the synchronous API path.
What breaks first
- Quota miscalibration. A new project's first month always blows the budget. Set a tiny initial quota per key, raise it after observing actual usage.
- vLLM image bumps drop OAI compat. Pin the SHA. Read changelogs before bumping.
- LiteLLM database growth. Per-call logging grows fast. Rotate or move to Postgres for retention.
- Caddy auto-renew failure if your DNS provider rate-limits. Use DNS-01 challenge with a controlled provider (Cloudflare).
- GPU fan failure on dust buildup. Quarterly cleaning isn't optional on a 24/7 homelab.
Composes these stacks
The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.
Open the custom build engine and explore which hardware tier actually supports this workflow.
Workflow validation
Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 2 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.
- Unvalidatedqwen-2.5-14b-instruct via vllm
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark → - Unvalidatedqwen-2.5-coder-3b via vllm
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →