RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Workflows
  4. /Offline RAG pipeline
Edge / air-gapped
Week build-out

Offline RAG pipeline

Air-gapped retrieval-augmented chat. Document ingestion via unstructured.io, nomic-embed embeddings into Qdrant, bge-reranker rerank, Qwen 2.5 14B-Instruct generation, Open WebUI as the chat surface, observability + nightly snapshot backup.

By Fredoline Eruo · Reviewed 2026-05-07 · ~2,100 words

Build summary

Hardware footprint
RTX 4090 OR Apple M3 Max 64 GB OR dual 3090 · 64 GB RAM · 2 TB NVMe
Concurrency
5-15 concurrent users (single 4090). Higher needs replicas.
Power
Sustained 350-450 W on RTX; ~120 W on Apple M3 Max.

Goal: Ship a private Q&A system over a corpus of internal documents that never leaves the network.

Operator card

Workflow
Best for
  • ✓Compliance-heavy teams that can't ship documents to cloud LLMs
  • ✓Internal knowledge-base Q&A on private corpora
  • ✓Regulated industries (legal, healthcare, finance)
  • ✓Single-team RAG at 5-15 concurrent users
Avoid if
  • ⚠You need >50 concurrent users (move to multi-replica or cloud)
  • ⚠Your corpus is mostly low-quality scanned PDFs (OCR pre-step required)
  • ⚠You don't have an ops person who can run Docker + Prometheus
Stability
battle tested
Maintenance
Weekly attention
Skill
Advanced
Long-session reliability
reliable

Service ledger

10 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute
nomic-embed-text-v1.5
Embeddings
Embeddings model. Open-weights bi-encoder; MTEB-competitive; 137M params; runs comfortably on the same GPU alongside the LLM.
Runs: llama.cpp server
bge-reranker-v2-m3
Reranker
Cross-encoder reranker. Top-10 → top-5 rerank lifts grounded-generation quality dramatically; cheap enough to run on CPU.
Runs: FastAPI sidecar
vLLM
Inference
8000/tcp
Inference engine. Concurrent batching scales to 5-15 users on a single 4090. Ollama works for solo but caps out at one stream.
Runs: Docker container, GPU 0
Qwen 2.5 14B-Instruct (AWQ)
Model
Generator LLM. Long-context (128K), strong instruction-following, fits 24 GB with 32K window comfortably. Llama 3.1 8B is the alternative when latency matters more than reasoning depth.
Runs: vLLM
Surface
Open WebUI (RAG mode)
Frontend
8080/tcp
Chat surface. Built-in RAG with hybrid retrieval; multi-user authentication; per-user document scoping. The MS-Teams alternative needs MSAL config.
Runs: Docker container
Data
unstructured.io (open-source)
Ingest
Document loader. Handles PDF / DOCX / HTML / Markdown / images via OCR. The closed-cloud version has more parsers but the OSS path covers 95% of typical documents.
Runs: Docker container, batch jobs
Qdrant
Vector DB
6333/tcp (loopback)
Vector DB. Single-binary, payload-filtering native, snapshot/restore built in. pgvector is the alternative if you already operate Postgres.
Runs: Docker container
MinIO
Storage
9000/tcp
Object storage for documents. S3-compatible self-hosted blob store. Stores raw documents separately from the embedding index so re-ingestion doesn't re-download.
Runs: Docker, named volume
Operations
Caddy
Proxy / TLS
Reverse proxy + TLS. Sane defaults, auto-TLS. Suitable for the small-team / on-prem audience this workflow targets.
Runs: host systemd
Prometheus + Grafana + Qdrant exporter
Observability
Metrics. Qdrant exposes /metrics directly; vLLM exports natively. Watch retrieval latency p99 and rerank latency p99 — they're the user-facing pain.
Runs: Docker compose

Hardware

RTX 4090 is the comfortable single-card target. Apple M3 Max 64 GB is the silent alternative — runs the same models via MLX-LM at ~25-35% lower decode tok/s but draws a fraction of the power.

The vector DB and reranker tax CPU + RAM more than the LLM does. Budget 32 GB RAM minimum for Qdrant + Open WebUI's RAG processing concurrent with the model. 64 GB is the comfortable working number.

NVMe storage is non-negotiable. SATA SSDs choke on large-corpus ingestion (HDD chokes at the first batch).

Storage

Plan storage in three tiers: (1) raw documents in MinIO (1× corpus size), (2) Qdrant HNSW indices (150 MB per million chunks at 768 dims), (3) snapshot backups (~Qdrant size × N retention generations).

For a 1 GB corpus of typical PDFs: ~3 GB raw (some documents have heavy images), ~50K chunks, ~75 MB Qdrant index. Even a 100 GB corpus stays under 500 GB total once everything's quantized.

Snapshot strategy. Qdrant supports atomic snapshots without downtime. Run a nightly cron that snapshots → uploads to a second storage volume (or to MinIO). Keep 7 daily + 4 weekly + 6 monthly. Total cold-storage cost stays under 100 GB for any practical corpus.

Networking

Air-gapped means: no DNS to public resolvers, no NTP to public servers, no auto-updates. Run a local Pi-hole + an internal NTP server.

If users access via a private corporate network: bind Caddy to the internal interface only. If users access via Tailscale: bind to the tailnet interface only. Never to 0.0.0.0.

Inside the workflow: every container talks via Docker bridge networks, no published ports except Caddy (443) and (optionally) Open WebUI debug (8080 loopback).

Observability

Critical metrics:

  • Retrieval latency p99. Qdrant cold-start can take 100ms+; warm queries are <20ms. Sustained p99 > 200ms means the index doesn't fit in RAM.
  • Rerank latency p99. bge-reranker-v2-m3 on CPU ~80-150ms for top-10. Sustained > 400ms means the CPU is overcommitted.
  • Generation tok/s. Should stay above 30 tok/s on a 4090 + 14B AWQ; below means concurrent users exceeded capacity.
  • Document ingestion success rate. unstructured.io fails on ~2-5% of typical PDFs (scans, password-protected, mixed RTL). Track and triage manually.

Security

Document scoping. Open WebUI supports per-user document collections — use them. Never share collections across teams that have different access policies.

Embedding model integrity. Pin the embeddings model SHA. A swapped embedding model breaks every existing query in subtle ways and is a supply-chain attack vector.

Audit trail. Open WebUI logs every query + retrieval; pipe to Loki + retain 90 days for compliance audits.

Backup encryption. Qdrant snapshots contain the full text of indexed documents. Encrypt at rest with age or gpg before shipping to off-site.

Upgrade path

Tighter retrieval (more accurate citations): swap nomic-embed → e5-mistral-7b-instruct (much larger, ~10 GB VRAM) for top-tier MTEB scores. Or stack: keep nomic-embed for speed, run e5-mistral as a secondary embedder for re-ranking via vector similarity.

Larger corpus (>10 GB documents): move from single Qdrant node → 3-node Qdrant cluster on shared NVMe. Adds operational complexity but stays self-hosted.

Multi-tenant production: add per-tenant Qdrant collections, audit logging via Loki + Vector, an API gateway (Kong) in front of Open WebUI for SSO.

What breaks first

  1. OCR coverage. Scanned PDFs hit unstructured.io's OCR fallback (Tesseract) which works but slowly and with errors. Either pre-OCR with a better tool (Surya, Textract) or live with degradation.
  2. Document drift. Re-ingestion of changed documents creates orphan vectors. Run a periodic "find vectors with no matching document" cleanup.
  3. Reranker bottleneck. bge-reranker on CPU caps at ~10 reranks/sec. At 15 concurrent users you'll start queueing; move reranker to GPU when this happens.
  4. Open WebUI version drift. 0.x → 0.y minor bumps occasionally break the RAG pipeline. Pin the image SHA.
  5. Snapshot rotation forgotten. Eventually fills the disk and Qdrant goes read-only mid-day. Set up disk-usage alerts in Grafana.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/offline-rag-workstation →/stacks/memory-enabled-agent →
Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

unvalidated

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 1 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

  • Unvalidated
    qwen-2.5-14b-instruct via vllm
    • · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
    0 benchmarksSubmit the first benchmark →
✓EditorialValidate this workflow →See benchmark roadmap →How validation works →
Help keep this page accurate

We read every submission. Editorial review takes 1-7 days.

Report outdatedSuggest a correctionDid this workflow work for you?