Build an RTX 4090 AI workstation stack (May 2026) — vLLM + Ollama + Open WebUI + AnythingLLM

Why this stack on this hardware

The 4090's 24GB VRAM creates a specific architectural window. 16GB cards cannot run a 32B-class coding model with real context room; 48GB+ cards (L40S / 6000 Ada / 5090 paired) shift you into a different cost tier. The 4090 sits in the middle and rewards a stack that respects both constraints — fits the model AND keeps headroom for batch serving across multiple users.

The headline architectural choice this stack makes: vLLM and Ollama coexist on the same machine, serving different workflows. Most guides treat them as competitors; on a 24GB workstation they're complementary — vLLM owns the “production endpoint we serve to frontends” role; Ollama owns the “I just want to try this model right now” role. Different ports, different lifecycle expectations, different model rotation rates.

Step-by-step setup

1. Bring up vLLM as the production endpoint

# Run vLLM on port 8000 — the production-facing endpoint
docker run --gpus all --rm -d --name vllm \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --restart unless-stopped \
  vllm/vllm-openai:v0.17.1 \
  --model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --enable-chunked-prefill

Note --gpu-memory-utilization 0.85 rather than 0.9 — leaving 15% (3.6GB) headroom for Ollama to coexist on the same card. Ollama doesn't pre-reserve VRAM the way vLLM does, so the same allocation that worked at 0.9 in a single-runtime setup will OOM the moment Ollama loads something.

2. Add Ollama on a different port for ad-hoc work

# Ollama on its default port (11434) — no Docker, native install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a chat model that's distinct from the coding model in vLLM
ollama pull qwen3:14b

# Verify both runtimes are alive on different ports
curl http://localhost:8000/v1/models   # vLLM (Qwen Coder 32B)
curl http://localhost:11434/api/tags   # Ollama (Qwen 3 14B)

Both can run concurrently because they hold different model weights — vLLM has Qwen Coder loaded; Ollama loads Qwen 3 14B on demand and unloads it when idle. Total VRAM under load: ~22GB; idle Ollama drops to ~12GB.

3. Wire Open WebUI as the team frontend

docker run -d --name open-webui \
  -p 3000:8080 \
  --restart unless-stopped \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
  -e OPENAI_API_KEYS="any-string" \
  -e ENABLE_OLLAMA_API=true \
  -e OLLAMA_BASE_URLS="http://host.docker.internal:11434" \
  ghcr.io/open-webui/open-webui:latest

Open WebUI sees both endpoints and presents them as model options. Users pick “Qwen Coder 32B (vLLM)” for coding tasks and “Qwen 3 14B (Ollama)” for chat. Same UI; different runtimes; the model switcher is transparent.

4. Add AnythingLLM for the RAG-workspace surface

docker run -d --name anythingllm \
  -p 3001:3001 \
  --restart unless-stopped \
  --cap-add SYS_ADMIN \
  -v anythingllm-storage:/app/server/storage \
  -e LLM_PROVIDER="generic-openai" \
  -e GENERIC_OPEN_AI_BASE_PATH="http://host.docker.internal:8000/v1" \
  -e GENERIC_OPEN_AI_MODEL_PREF="Qwen/Qwen2.5-Coder-32B-Instruct-AWQ" \
  mintplexlabs/anythingllm

AnythingLLM gets the same vLLM endpoint as Open WebUI — they share the model, isolate the workspace. Different role: Open WebUI for direct chat, AnythingLLM for “chat with my documents.” Both alive on different ports (3000 and 3001).

OS-level tuning that actually matters

The configuration that affects throughput on this stack (and the configuration that doesn't):

NVIDIA driver >= 555. Older drivers have FlashAttention-2 kernel selection bugs that silently halve vLLM throughput. Run nvidia-smi --query-gpu=driver_version --format=csv to verify.
nvidia-persistenced running. Without it, the GPU re-initializes on every CUDA context create, adding 100-300ms to first-token latency on cold starts. Enable with sudo systemctl enable --now nvidia-persistenced.
NVMe scheduler set to none. The default mq-deadline scheduler adds latency on model load. echo none | sudo tee /sys/block/nvme0n1/queue/scheduler or pin in /etc/udev/rules.d/.
System RAM 64GB minimum. The OS file cache holds the model weights between vLLM cold starts; 32GB systems re-read from disk on every restart, adding 20-40 seconds to startup.
Power limit set to 350W for thermal sustainability under continuous load. The 4090's 450W TDP is fine for bursts but reduces card lifetime in sustained inference. nvidia-smi -pl 350 sets it; pin in a systemd service to make it persistent.

What does NOT meaningfully matter on this stack: PCIe bifurcation (single-card workloads don't care), RAM frequency above DDR5-5600 (CPU-side bandwidth isn't the bottleneck), CPU core count above 8 (vLLM threading is GPU-bound).

Failure modes you'll hit

OOM when both vLLM and Ollama load. vLLM at--gpu-memory-utilization 0.9 + Ollama loading a 14B model = OOM. Drop vLLM to 0.85 (recommended above) or run Ollama with OLLAMA_KEEP_ALIVE=0 so it unloads aggressively.
Open WebUI can't see vLLM models. host.docker.internal doesn't resolve on Linux by default. Either run with --add-host=host.docker.internal:host-gateway or use the host network mode (--network=host).
Coil whine on light load. 4090s sing audibly under low-utilization GPU load. Power-limiting to 350W usually cures it; if it doesn't, the card is within spec (Nvidia's position) but you may want to RMA. Most stack-builders accept it as the price of consumer-tier hardware.
Thermal throttling at 30+ minutes of sustained load. Stock 4090 cooling handles bursts but a tight chassis with one 120mm exhaust will hit 87°C and throttle. Verify with nvidia-smi --query-gpu=temperature.gpu --format=csv -l 1 during a long generation; add chassis fans or undervolt if it climbs past 80°C.
Open WebUI persistent-volume corruption. Killing the container during write can corrupt the SQLite db. Mitigate with --restart unless-stopped (above) and an explicit volume backup before any docker rm.

Variations and alternatives

5090 swap. If you have a 5090, the architectural shape doesn't change — same vLLM + Ollama + frontends pattern. You get ~30% more throughput and FP4 support; not enough to justify the price-jump on its own, but if you already own one, no reconfiguration needed.

Multi-GPU 4090 variation. 2x 4090 with NVLink isn't a thing on consumer SKUs (NVIDIA disabled NVLink on Ada consumer); 2x over PCIe loses 30-40% of throughput to interconnect bandwidth. Usually not worth it unless the model genuinely won't fit. See /systems/distributed-inference for the math.

SGLang variation. If your team workflow is heavy on agent loops with stable system prompts (10+ tool calls per task on a fixed prefix), SGLang can replace vLLM for 1.3-1.7x aggregate throughput. The frontends and Ollama side stay the same.

Linux vs Windows-WSL2. Both work. Linux is ~5% faster on raw throughput due to lower CUDA driver overhead and better filesystem performance for the model cache. WSL2 catches up on most workloads; the only place it regresses meaningfully is rapid model-swap workflows where the file cache matters most.

Going deeper

RTX 4090 catalog entry — VRAM math, thermal characteristics, the long-tail of quirks and how to tune them out.
vLLM operational review — the runtime-specific operator detail behind the production endpoint pick.
Inference runtime ecosystem map — full landscape of what could replace vLLM or Ollama in this stack.
Local coding-agent stack — the specialised workstation variation focused on coding.