Homelab

Weekend build-out

Homelab AI API gateway

Self-hosted OpenAI-compatible API for everything you build. vLLM + LiteLLM + Caddy + per-app API keys + Prometheus metrics. The drop-in replacement for the OpenAI Python client across your homelab projects.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,700 words

Build summary

Hardware footprint

RTX 4090 OR dual 3090 · 64 GB RAM · 1 TB NVMe

Concurrency

5-20 concurrent personal projects (light burst per project).

Power

~300-400 W under typical homelab load.

Goal: One self-hosted endpoint that every personal project (Discord bot, Obsidian plugin, IDE, scripts) can hit with no cloud cost.

Operator card

Workflow

Best for

✓Homelab operators with 3+ side projects calling LLMs
✓Anyone tired of OpenAI/Anthropic monthly invoices
✓Privacy-conscious devs who don't want their prompts in cloud logs
✓Engineers building / testing OAI-compatible clients

Avoid if

⚠You only have one project (just run Ollama directly)
⚠You need >50 concurrent users (production tier)
⚠You're not comfortable maintaining 3-4 Docker containers

Stability

stable

Maintenance

Monthly check

Skill

Intermediate

Long-session reliability

reliable

Service ledger

7 services across 2 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute

vLLM

Inference

8000/tcp

Primary inference. Continuous batching makes this the right backend when N small clients are firing 1-shot requests on overlapping schedules.

Runs: Docker, GPU 0

Qwen 2.5 14B Instruct

Model

General-purpose model. Strong general-purpose default — chat, coding, classification, summarization. Fits the typical homelab GPU comfortably.

Runs: vLLM

Qwen 2.5 Coder 7B

Model

Coding specialist. Routed via LiteLLM when client requests model=qwen-coder. Shares the same vLLM instance via dynamic loading or runs on a second port.

Runs: second vLLM container

nomic-embed-text

Embeddings

Embeddings. Lets every personal project use the same embeddings without hitting OpenAI. Routed via LiteLLM as model=nomic-embed-text.

Runs: Ollama

Operations

Caddy

Proxy / TLS

TLS + rate limit. Auto-TLS via Let's Encrypt; built-in rate-limiting plugin; Caddyfile is short enough to read at a glance.

Runs: host systemd

LiteLLM virtual keys

Auth

API key management. Per-app virtual keys with budgets, rate limits, model whitelisting. Acts as the IAM layer in front of the inference engine.

Runs: embedded in LiteLLM

Prometheus + Grafana

Observability

Metrics + dashboards. vLLM and LiteLLM both export Prometheus natively; Grafana dashboards for per-key spend, latency, queue depth.

Runs: Docker compose

Hardware

Single RTX 4090 is sweet-spot. Dual 3090 unlocks 70B serving for the rare project that needs it.

The workload pattern matters more than peak throughput here. Most homelab API calls are short bursts (Discord bot reply, Obsidian summarization, script classification). vLLM's continuous batching shines exactly for this — many short requests with tiny KV-cache footprints share GPU well.

Aim for 80% peak GPU utilization average, 100% at burst. >100% sustained means you've outgrown a single card.

Storage

~50 GB for model weights, ~5 GB for LiteLLM logs (rotate weekly), trivial for everything else.

If you log full request/response pairs (LiteLLM has a flag), prompt history grows fast — ~10 KB per call. 10K calls/day = 100 MB/day. Rotate aggressively or pipe to Loki with a 30-day retention.

Networking

Bind LiteLLM to 127.0.0.1, Caddy to 0.0.0.0:443. Caddy terminates TLS, forwards to LiteLLM. LiteLLM forwards to vLLM / Ollama on private Docker network.

For LAN-only access: bind Caddy to LAN interface, no public DNS.

For remote access: Tailscale OR Cloudflare Tunnel + Cloudflare Access (gate by Google SSO email).

NEVER expose vLLM or Ollama directly to the internet — they have no auth and no rate-limiting.

Observability

Per-key metrics are the operational gold:

Calls per minute per virtual key. Catch runaway scripts before they exhaust GPU.
Tokens per call distribution. Long-tail of huge prompts often = a bug somewhere.
Cost-equivalent (vs OpenAI). LiteLLM tracks "you saved $X this month" — fun and useful.
vLLM queue depth. Sustained > 5 means you've outgrown the single-card tier.

Security

Per-app keys. Issue a new virtual key per project. Set per-key budgets (1M tokens/month) so a runaway script can't hose the whole gateway.

Rate limits in LiteLLM. Cap requests/min per key.

Model whitelisting. Let the Discord bot only call qwen-7b; let your Aider workspace call qwen-coder-32b. Lock down high-cost models behind keys you actually trust.

Caddy TLS. Don't disable it. The cost-of-Let's-Encrypt is zero.

Audit log. LiteLLM writes a per-call log; pipe to Loki + retain 30 days. Catch abuse early.

Upgrade path

More models: drop them in via LiteLLM config; LiteLLM auto-discovers the new model from vLLM.

Higher concurrency: scale vertical (better GPU) before you scale horizontal — homelab Ray Serve adds operational pain that's rarely worth it solo.

Cloud hybrid: LiteLLM can route specific models to OpenAI / Anthropic / Together when needed. The client code stays the same.

Long-running jobs: add a Celery / RQ queue for batch jobs (transcription, summarization at scale). Don't pollute the synchronous API path.

What breaks first

Quota miscalibration. A new project's first month always blows the budget. Set a tiny initial quota per key, raise it after observing actual usage.
vLLM image bumps drop OAI compat. Pin the SHA. Read changelogs before bumping.
LiteLLM database growth. Per-call logging grows fast. Rotate or move to Postgres for retention.
Caddy auto-renew failure if your DNS provider rate-limits. Use DNS-01 challenge with a controlled provider (Cloudflare).
GPU fan failure on dust buildup. Quarterly cleaning isn't optional on a 24/7 homelab.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/rtx-4090-workstation →/stacks/dual-3090-workstation →

Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

unvalidated

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 2 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

Unvalidated
qwen-2.5-14b-instruct via vllm
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →
Unvalidated
qwen-2.5-coder-3b via vllm
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →

EditorialValidate this workflow →See benchmark roadmap →How validation works →