Build a local coding-agent stack (May 2026) — OpenHands + Qwen 32B + vLLM + Mem0 + MCP

Step-by-step setup

The four commands that take this stack from zero to working agent. Run them in order on a Linux box with CUDA 12.x already installed; on macOS, swap the GPU step for MLX-LM (see the Apple Silicon variation below).

1. Bring up vLLM with the coding model

# Pull the AWQ-INT4 quant — fits a 24GB card with 32K context
docker run --gpus all --rm -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.17.1 \
  --model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --enable-chunked-prefill

The --enable-chunked-prefill flag is non-optional — long-context prefills (the agent will routinely read 1000+ line files) will otherwise stall every other request for 1-3 seconds. gpu-memory-utilization=0.9 leaves ~2GB of VRAM headroom; lower it to 0.85 if you OOM on first inference.

2. Install the MCP servers

# Filesystem — strict allowlist limits agent blast radius
npx -y @modelcontextprotocol/server-filesystem ~/projects/myrepo

# Git — read-side repo metadata
npx -y @modelcontextprotocol/server-git --repository ~/projects/myrepo

Both run as stdio MCP servers — OpenHands launches them on demand and tears them down between sessions. Pin the allowlisted directory to one repo at a time; never point filesystem MCP at ~/ or your blast radius is your entire home directory.

3. Wire OpenHands to the stack

# config.toml
[llm]
model = "openai/Qwen/Qwen2.5-Coder-32B-Instruct-AWQ"
api_base = "http://localhost:8000/v1"
api_key = "anything"  # vLLM doesn't check it

[mcp]
servers = [
  { command = "npx", args = ["-y", "@modelcontextprotocol/server-filesystem", "/home/you/projects/myrepo"] },
  { command = "npx", args = ["-y", "@modelcontextprotocol/server-git", "--repository", "/home/you/projects/myrepo"] }
]

[memory]
provider = "mem0"
config = { api_key = "local", host = "http://localhost:11434" }

4. Run a real task

# Drop OpenHands into Planning Mode for the first run
openhands run --plan-first \
  --task "Find the bug causing the auth_token validation to fail \
          on expired tokens and write a regression test"

The agent should: read the relevant files via filesystem MCP, examine recent commits via git MCP, draft a plan (with Planning Mode this is shown to you for approval), then make the edit and run the test suite. End-to-end on a real bugfix: 60-180 seconds. If your first run takes 10+ minutes, you have a configuration problem — see Failure Modes below.

Failure modes you'll hit

The list of things that go wrong with this stack, in rough order of how often we've seen them:

vLLM OOM on first inference (not on load). The model loaded fine but the first request crashes. Lower --gpu-memory-utilization from 0.9 to 0.85, or drop --max-model-len from 32768 to 16384.
Agent loops on plan revision. OpenHands keeps re-planning instead of executing — usually means the model isn't getting a clear enough “ok, plan approved, execute” signal. With Planning Mode, this is fixed by explicitly approving the plan in the UI; in headless mode, set plan_first = false after the first session.
Filesystem MCP path-escape attempt. The allowlist is enforced; symptom is the agent reporting “permission denied” on files outside your repo. That's correct behaviour. If you need a wider scope, widen the allowlist deliberately rather than disabling it.
Mem0 retrieves stale codebase context. The memory layer learned the codebase as it was 3 weeks ago; the agent now reasons against stale knowledge. Re-ingest after major refactors with mem0 reindex --workspace myrepo.
vLLM prefix cache invalidation on every request. If your TTFT is 200-500ms instead of <50ms after the first call, your system prompt is templating variable user data. Move the variable parts to the user message; system prompt should be byte-identical across the agent loop.
Test suite hangs the agent indefinitely. Long-running tests (integration suites that boot a database) blow past OpenHands' default tool-call timeout silently. Set per-tool timeouts in the MCP config or wrap your test runner in a hard deadline.

Variations and alternatives

Where this stack is wrong for your situation, the swap-in alternatives:

Apple Silicon variation. Replace vLLM + RTX 4090 with MLX-LM + M3 Max 64GB. The rest of the stack is unchanged. Throughput drops ~30-40% vs a 4090 but you trade GPU heat for a battery-powered laptop. See the Apple Silicon AI stack for the dedicated path.

Surgical-edits variation. If you want git-integrated CLI editing rather than autonomous task execution, swap OpenHands for Aider. Same model + runtime + MCP layer; different agent paradigm.

Higher-throughput agent-loop variation. Replace vLLM with SGLang if your stack does >10 tool calls per task on a stable system prompt. RadixAttention's tree-structured KV cache makes shared-prefix workloads ~1.3-1.7x faster. See the SGLang operational review for when this swap pays off.

Larger-codebase variation. For repos >1M tokens of context, swap Mem0 for Zep or Graphiti — temporal knowledge-graph memory holds long-horizon context better than flat vector retrieval.

How to verify the stack is healthy

The smoke tests we run on this stack:

Throughput: curl -X POST http://localhost:8000/v1/completions ... with a 100-token prompt should sustain >30 tok/s on a 4090. If you're below 20 tok/s, vLLM picked the wrong kernels — check NCCL_DEBUG=INFO output and pin the Docker image rather than running pip-installed.
TTFT: first-token latency for a cache-hit prefix should be <50ms. Repeat the same system prompt three times; the 2nd and 3rd should be much faster than the 1st. If they aren't, your prefix cache isn't hitting — see failure mode #5.
End-to-end: ask the agent to fix a deliberately-broken test in a small repo. Should complete in <3 minutes. If it takes >5, the loop is wrong.
Memory: close the agent, restart, ask “what did we change last session?” — Mem0 should surface the prior session's changes. If it doesn't, your memory provider isn't persisting.

Going deeper

The reference reading that backs every component pick in this stack:

OpenHands catalog entry — agent architecture, Planning Mode details, MCP integration patterns.
vLLM operational review — the runtime-specific operator detail (KV cache math, the gpu_memory_utilization knob, prefix cache invalidation).
Mem0 catalog entry — memory-layer integration patterns and tradeoffs vs Letta / Zep / Graphiti.
/systems/mcp — the protocol layer the filesystem and git servers run on.
Inference runtime ecosystem map — where vLLM sits relative to the alternatives, with the full landscape categorized.