Edge / air-gapped

Week build-out

Offline RAG pipeline

Air-gapped retrieval-augmented chat. Document ingestion via unstructured.io, nomic-embed embeddings into Qdrant, bge-reranker rerank, Qwen 2.5 14B-Instruct generation, Open WebUI as the chat surface, observability + nightly snapshot backup.

By Fredoline Eruo · Reviewed 2026-05-07 · ~2,100 words

Build summary

Hardware footprint

RTX 4090 OR Apple M3 Max 64 GB OR dual 3090 · 64 GB RAM · 2 TB NVMe

Concurrency

5-15 concurrent users (single 4090). Higher needs replicas.

Power

Sustained 350-450 W on RTX; ~120 W on Apple M3 Max.

Goal: Ship a private Q&A system over a corpus of internal documents that never leaves the network.

Operator card

Workflow

Best for

✓Compliance-heavy teams that can't ship documents to cloud LLMs
✓Internal knowledge-base Q&A on private corpora
✓Regulated industries (legal, healthcare, finance)
✓Single-team RAG at 5-15 concurrent users

Avoid if

⚠You need >50 concurrent users (move to multi-replica or cloud)
⚠Your corpus is mostly low-quality scanned PDFs (OCR pre-step required)
⚠You don't have an ops person who can run Docker + Prometheus

Stability

battle tested

Maintenance

Weekly attention

Skill

Advanced

Long-session reliability

reliable

Service ledger

10 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute

nomic-embed-text-v1.5

Embeddings

Embeddings model. Open-weights bi-encoder; MTEB-competitive; 137M params; runs comfortably on the same GPU alongside the LLM.

Runs: llama.cpp server

bge-reranker-v2-m3

Reranker

Cross-encoder reranker. Top-10 → top-5 rerank lifts grounded-generation quality dramatically; cheap enough to run on CPU.

Runs: FastAPI sidecar

vLLM

Inference

8000/tcp

Inference engine. Concurrent batching scales to 5-15 users on a single 4090. Ollama works for solo but caps out at one stream.

Runs: Docker container, GPU 0

Qwen 2.5 14B-Instruct (AWQ)

Model

Generator LLM. Long-context (128K), strong instruction-following, fits 24 GB with 32K window comfortably. Llama 3.1 8B is the alternative when latency matters more than reasoning depth.

Runs: vLLM

Surface

Open WebUI (RAG mode)

Frontend

8080/tcp

Chat surface. Built-in RAG with hybrid retrieval; multi-user authentication; per-user document scoping. The MS-Teams alternative needs MSAL config.

Runs: Docker container

Data

unstructured.io (open-source)

Ingest

Document loader. Handles PDF / DOCX / HTML / Markdown / images via OCR. The closed-cloud version has more parsers but the OSS path covers 95% of typical documents.

Runs: Docker container, batch jobs

Qdrant

Vector DB

6333/tcp (loopback)

Vector DB. Single-binary, payload-filtering native, snapshot/restore built in. pgvector is the alternative if you already operate Postgres.

Runs: Docker container

MinIO

Storage

9000/tcp

Object storage for documents. S3-compatible self-hosted blob store. Stores raw documents separately from the embedding index so re-ingestion doesn't re-download.

Runs: Docker, named volume

Operations

Caddy

Proxy / TLS

Reverse proxy + TLS. Sane defaults, auto-TLS. Suitable for the small-team / on-prem audience this workflow targets.

Runs: host systemd

Prometheus + Grafana + Qdrant exporter

Observability

Metrics. Qdrant exposes /metrics directly; vLLM exports natively. Watch retrieval latency p99 and rerank latency p99 — they're the user-facing pain.

Runs: Docker compose

Hardware

RTX 4090 is the comfortable single-card target. Apple M3 Max 64 GB is the silent alternative — runs the same models via MLX-LM at ~25-35% lower decode tok/s but draws a fraction of the power.

The vector DB and reranker tax CPU + RAM more than the LLM does. Budget 32 GB RAM minimum for Qdrant + Open WebUI's RAG processing concurrent with the model. 64 GB is the comfortable working number.

NVMe storage is non-negotiable. SATA SSDs choke on large-corpus ingestion (HDD chokes at the first batch).

Storage

Plan storage in three tiers: (1) raw documents in MinIO (1× corpus size), (2) Qdrant HNSW indices (150 MB per million chunks at 768 dims), (3) snapshot backups (~Qdrant size × N retention generations).

For a 1 GB corpus of typical PDFs: ~3 GB raw (some documents have heavy images), ~50K chunks, ~75 MB Qdrant index. Even a 100 GB corpus stays under 500 GB total once everything's quantized.

Snapshot strategy. Qdrant supports atomic snapshots without downtime. Run a nightly cron that snapshots → uploads to a second storage volume (or to MinIO). Keep 7 daily + 4 weekly + 6 monthly. Total cold-storage cost stays under 100 GB for any practical corpus.

Networking

Air-gapped means: no DNS to public resolvers, no NTP to public servers, no auto-updates. Run a local Pi-hole + an internal NTP server.

If users access via a private corporate network: bind Caddy to the internal interface only. If users access via Tailscale: bind to the tailnet interface only. Never to 0.0.0.0.

Inside the workflow: every container talks via Docker bridge networks, no published ports except Caddy (443) and (optionally) Open WebUI debug (8080 loopback).

Observability

Critical metrics:

Retrieval latency p99. Qdrant cold-start can take 100ms+; warm queries are <20ms. Sustained p99 > 200ms means the index doesn't fit in RAM.
Rerank latency p99. bge-reranker-v2-m3 on CPU ~80-150ms for top-10. Sustained > 400ms means the CPU is overcommitted.
Generation tok/s. Should stay above 30 tok/s on a 4090 + 14B AWQ; below means concurrent users exceeded capacity.
Document ingestion success rate. unstructured.io fails on ~2-5% of typical PDFs (scans, password-protected, mixed RTL). Track and triage manually.

Security

Document scoping. Open WebUI supports per-user document collections — use them. Never share collections across teams that have different access policies.

Embedding model integrity. Pin the embeddings model SHA. A swapped embedding model breaks every existing query in subtle ways and is a supply-chain attack vector.

Audit trail. Open WebUI logs every query + retrieval; pipe to Loki + retain 90 days for compliance audits.

Backup encryption. Qdrant snapshots contain the full text of indexed documents. Encrypt at rest with age or gpg before shipping to off-site.

Upgrade path

Tighter retrieval (more accurate citations): swap nomic-embed → e5-mistral-7b-instruct (much larger, ~10 GB VRAM) for top-tier MTEB scores. Or stack: keep nomic-embed for speed, run e5-mistral as a secondary embedder for re-ranking via vector similarity.

Larger corpus (>10 GB documents): move from single Qdrant node → 3-node Qdrant cluster on shared NVMe. Adds operational complexity but stays self-hosted.

Multi-tenant production: add per-tenant Qdrant collections, audit logging via Loki + Vector, an API gateway (Kong) in front of Open WebUI for SSO.

What breaks first

OCR coverage. Scanned PDFs hit unstructured.io's OCR fallback (Tesseract) which works but slowly and with errors. Either pre-OCR with a better tool (Surya, Textract) or live with degradation.
Document drift. Re-ingestion of changed documents creates orphan vectors. Run a periodic "find vectors with no matching document" cleanup.
Reranker bottleneck. bge-reranker on CPU caps at ~10 reranks/sec. At 15 concurrent users you'll start queueing; move reranker to GPU when this happens.
Open WebUI version drift. 0.x → 0.y minor bumps occasionally break the RAG pipeline. Pin the image SHA.
Snapshot rotation forgotten. Eventually fills the disk and Qdrant goes read-only mid-day. Set up disk-usage alerts in Grafana.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/offline-rag-workstation →/stacks/memory-enabled-agent →

Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

unvalidated

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 1 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

Unvalidated
qwen-2.5-14b-instruct via vllm
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →

EditorialValidate this workflow →See benchmark roadmap →How validation works →