RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Workflows
  4. /Multi-user local AI server
Production
Month build-out

Multi-user local AI server

Production-tier self-hosted AI for 20-100 users. SGLang or vLLM with replicas, LiteLLM gateway, Postgres-backed Open WebUI, SSO, observability, audit logging, backup. The internal-tools-team setup.

By Fredoline Eruo · Reviewed 2026-05-07 · ~2,000 words

Build summary

Hardware footprint
Dual H100 SXM OR quad RTX 6000 Ada · 256 GB RAM · 4 TB NVMe RAID · 10 GbE
Concurrency
20-100 concurrent users.
Power
Sustained 1500-2500 W server-class.

Goal: Deploy a private LLM API + chat UI for an organization without sending traffic to a cloud LLM vendor.

Operator card

Workflow
Best for
  • ✓Companies replacing OpenAI/Anthropic API spend with self-hosted
  • ✓Regulated industries that can't ship data to cloud LLMs
  • ✓Internal tools teams with 20-200 users
  • ✓Organizations with an existing K8s + observability stack
Avoid if
  • ⚠Headcount < 20 — overkill (use [/workflows/homelab-ai-api](/workflows/homelab-ai-api))
  • ⚠You don't have a platform team
  • ⚠Your workload is bursty / unpredictable (cloud cheaper)
  • ⚠You can't commit to multi-year hardware lifecycle
Stability
battle tested
Maintenance
Daily attention
Skill
Expert
Long-session reliability
rock solid

Service ledger

7 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute
SGLang
Inference
30000/tcp
Inference engine. RadixAttention prefix-cache compounds wins when many users share system prompts (which they do). Beats vLLM on production agentic workloads.
Runs: Kubernetes deployment, GPU pool
Qwen 2.5 32B Instruct (FP8)
Model
Default model. FP8 on H100 fits comfortably with concurrent batching; 32K context is the multi-tenant default.
Runs: SGLang, dual H100
Surface
Open WebUI (Postgres backend)
Frontend
Chat surface. SQLite caps out around 10-20 active users; Postgres backing store handles 100+. SSO via OIDC.
Runs: Kubernetes deployment
Data
Qdrant cluster (3 nodes)
Vector DB
Shared vector DB. Multi-tenant per-user collections; horizontal scale; snapshots without downtime.
Runs: Kubernetes statefulset
MinIO + Velero
Storage
Object + backup storage. MinIO handles document attachments and backups; Velero snapshots K8s state for DR.
Runs: Kubernetes deployment
Operations
Authelia / Authentik
Auth
SSO. OIDC provider in front of Open WebUI + LiteLLM admin. Bridges to corporate AAD / Google Workspace via SAML.
Runs: Docker / K8s
Prometheus + Grafana + Loki + OpenTelemetry
Observability
Full o11y stack. Production needs metrics + logs + traces. Prometheus scrapes SGLang/LiteLLM; Loki ingests app logs; OTel collector unifies the trace surface.
Runs: Kubernetes deployment

Hardware

H100 SXM 80 GB is the sweet spot. Two cards via NVLink or NVLink-Switch fabric serve a 70B model with FP8 + concurrent batching for 50+ users.

RTX 6000 Ada (48 GB) is the lower-cost alternative — quad cards via tensor-parallel hit similar capacity at higher power draw.

Storage: NVMe RAID 1 minimum on the master node; Postgres + Qdrant are write-amplification heavy. 4 TB total gets you through ~2 years of conversation + document growth at typical org size.

Networking: 10 GbE between nodes is the floor. Inter-node tensor-parallel over 1 GbE is unusable.

Storage

Postgres for Open WebUI (50 GB / year / 100 users). Qdrant for shared embeddings (500 GB / 5M chunks). MinIO for raw documents + nightly backups (~2 TB rolling 30-day).

Backup strategy is non-negotiable. Velero schedules + offsite replication + monthly restore drills.

Conversation history is regulated data in many industries. Encrypt at rest. Define a retention policy (90 days? 1 year?) and enforce it via cron.

Networking

Internal: K8s ingress controller (nginx / Traefik), per-pod NetworkPolicies, mTLS between services via Linkerd or Istio.

External: corporate VPN OR Cloudflare Tunnel + Access. Public DNS, gated entry. Never expose SGLang / Qdrant directly.

DNS + LB: a single ai.internal.corp.com hostname; LB distributes across SGLang replicas.

Observability

Required dashboards:

  • Per-user usage (calls/day, tokens/day, latency p99). Catch runaway scripts.
  • Cluster health (GPU utilization across pods, KV-cache pressure, queue depth).
  • Cost-equivalent vs cloud (token volume × OpenAI rate).
  • Audit log volume. Compliance teams will ask.

Alerts:

  • GPU temp ≥ 84 °C → page ops
  • p99 latency > 5s → page ops
  • LiteLLM gateway down → page ops
  • Qdrant write errors > 1/min → page ops

OTel + Loki + Tempo gives you trace-level debugging when a specific user reports slowness.

Security

SSO + RBAC. Every user goes through Authelia → Open WebUI / LiteLLM admin. No shared accounts.

Per-user model whitelisting. Different teams get different model lists. Engineering may have access to coding models; HR doesn't.

Audit log retention. Legal will require this. 365 days minimum in most regulated industries.

Network segmentation. AI server in its own VLAN; no direct access to production databases.

Vulnerability scanning. Trivy on every image; Falco for runtime detection. Container escapes from agent code-execution sandboxes are a real threat.

Upgrade path

HA: dual-master K8s, multi-AZ if you have a real datacenter. Without HA, plan for ~99.9% uptime; with it, ~99.99%.

Bigger models: 405B-class needs 4-8× H100 with NVLink-Switch — at that scale, evaluate cloud H100 rental honestly. Self-hosting frontier-class models is rarely cost-justified for orgs <500 users.

Fine-tuning: add Axolotl / Unsloth on a separate GPU pool. Production inference and fine-tuning don't share GPUs cleanly.

Multi-region: replicate Postgres + Qdrant cross-region; add Cloudflare for global routing. Crosses the line into "you have a real platform team now."

What breaks first

  1. GPU fan / thermal failures on continuous load. Scheduled hardware swaps every 18-24 months.
  2. K8s node-version drift. Kubelet upgrades break GPU passthrough until DCGM-operator catches up. Stage upgrades.
  3. SGLang RadixAttention assumptions. When prefix-cache hit rate drops (e.g. agentic prompts diverge), throughput collapses. Profile per workload type.
  4. Postgres bloat. Open WebUI writes a lot. Run VACUUM ANALYZE weekly; consider pgbouncer for connection pooling.
  5. Audit log explosion. Per-call logs at 100 users grow fast. Loki + S3 backend, retention-tiered to cold storage at 90 days.
  6. Compliance review surprises. GDPR / HIPAA / SOC2 consultations always find one missing log or one unencrypted volume. Build to the standard from day one.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/distributed-inference-homelab →/stacks/h100-tensor-parallel-workstation →
Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

unvalidated

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 1 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

  • Unvalidated
    qwen-2.5-32b-instruct via sglang
    • · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
    0 benchmarksSubmit the first benchmark →
✓EditorialValidate this workflow →See benchmark roadmap →How validation works →
Help keep this page accurate

We read every submission. Editorial review takes 1-7 days.

Report outdatedSuggest a correctionDid this workflow work for you?