Build a local coding-agent stack (May 2026)
A coding agent that drafts diffs, runs tests, and edits files autonomously — entirely on your hardware, with persistent memory of the codebase.
- 01ToolCoding agent (the planning + execution loop)openhands
OpenHands v1.6 ships Planning Mode (drafts a plan before execution) and has the longest production track record in the open-source category. Pick OpenHands over Aider when you want autonomous task execution; pick Aider for surgical git-integrated edits.
- 02ModelCoding model (the actual brain)qwen-2.5-coder-32b-instruct
Qwen 2.5 Coder 32B Instruct is the strongest open coding model in the 32B class as of May 2026 — beats DeepSeek Coder V2 Lite on HumanEval+ and SWE-Bench Lite at the same VRAM footprint. AWQ-INT4 fits on a 24GB card with headroom for a 32K context window.
- 03ToolInference engine (production-grade serving)vllm
vLLM over Ollama for this stack: continuous batching means an agent making 5-10 concurrent tool calls per task doesn't queue, prefix caching keeps the system prompt resident across iterations, and the OpenAI-compatible API plugs into OpenHands with zero adapter code. Use Ollama only for single-user laptop chat.
- 04ToolFile access (the agent's hands on the codebase)mcp-server-filesystem
The Anthropic reference filesystem MCP server with strict directory allowlisting. Required for OpenHands to read and write project files; allowlist limits blast radius when the agent goes off the rails.
- 05ToolRepository state (status, diff, blame, history)mcp-server-git
Pairs with mcp-server-filesystem to give the agent full repo awareness — read-side operations only by default. Lets OpenHands reason about what changed and why before proposing new edits.
- 06ToolPersistent memory (codebase context across sessions)mem0
Mem0 over Letta or Zep for this stack: dropping a memory layer into OpenHands takes 20 lines of config; Letta's OS-style explicit memory management is overkill for a single-user coding agent; Zep's temporal knowledge graph is strong but slower to wire.
- 07HardwareGPU (where the model runs)rtx-4090
RTX 4090 24GB is the sweet spot for this stack: enough VRAM for Qwen 32B AWQ-INT4 + 32K context, enough memory bandwidth (1 TB/s) for sub-second TTFT, and consumer-grade thermals. The 5090 helps but isn't required; the 4080 16GB doesn't have headroom for the context window the agent actually needs.
Step-by-step setup
The four commands that take this stack from zero to working agent. Run them in order on a Linux box with CUDA 12.x already installed; on macOS, swap the GPU step for MLX-LM (see the Apple Silicon variation below).
1. Bring up vLLM with the coding model
# Pull the AWQ-INT4 quant — fits a 24GB card with 32K context
docker run --gpus all --rm -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:v0.17.1 \
--model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--enable-chunked-prefillThe --enable-chunked-prefill flag is non-optional — long-context prefills (the agent will routinely read 1000+ line files) will otherwise stall every other request for 1-3 seconds. gpu-memory-utilization=0.9 leaves ~2GB of VRAM headroom; lower it to 0.85 if you OOM on first inference.
2. Install the MCP servers
# Filesystem — strict allowlist limits agent blast radius
npx -y @modelcontextprotocol/server-filesystem ~/projects/myrepo
# Git — read-side repo metadata
npx -y @modelcontextprotocol/server-git --repository ~/projects/myrepoBoth run as stdio MCP servers — OpenHands launches them on demand and tears them down between sessions. Pin the allowlisted directory to one repo at a time; never point filesystem MCP at ~/ or your blast radius is your entire home directory.
3. Wire OpenHands to the stack
# config.toml
[llm]
model = "openai/Qwen/Qwen2.5-Coder-32B-Instruct-AWQ"
api_base = "http://localhost:8000/v1"
api_key = "anything" # vLLM doesn't check it
[mcp]
servers = [
{ command = "npx", args = ["-y", "@modelcontextprotocol/server-filesystem", "/home/you/projects/myrepo"] },
{ command = "npx", args = ["-y", "@modelcontextprotocol/server-git", "--repository", "/home/you/projects/myrepo"] }
]
[memory]
provider = "mem0"
config = { api_key = "local", host = "http://localhost:11434" }4. Run a real task
# Drop OpenHands into Planning Mode for the first run
openhands run --plan-first \
--task "Find the bug causing the auth_token validation to fail \
on expired tokens and write a regression test"The agent should: read the relevant files via filesystem MCP, examine recent commits via git MCP, draft a plan (with Planning Mode this is shown to you for approval), then make the edit and run the test suite. End-to-end on a real bugfix: 60-180 seconds. If your first run takes 10+ minutes, you have a configuration problem — see Failure Modes below.
Failure modes you'll hit
The list of things that go wrong with this stack, in rough order of how often we've seen them:
- vLLM OOM on first inference (not on load). The model loaded fine but the first request crashes. Lower
--gpu-memory-utilizationfrom 0.9 to 0.85, or drop--max-model-lenfrom 32768 to 16384. - Agent loops on plan revision. OpenHands keeps re-planning instead of executing — usually means the model isn't getting a clear enough “ok, plan approved, execute” signal. With Planning Mode, this is fixed by explicitly approving the plan in the UI; in headless mode, set
plan_first = falseafter the first session. - Filesystem MCP path-escape attempt. The allowlist is enforced; symptom is the agent reporting “permission denied” on files outside your repo. That's correct behaviour. If you need a wider scope, widen the allowlist deliberately rather than disabling it.
- Mem0 retrieves stale codebase context. The memory layer learned the codebase as it was 3 weeks ago; the agent now reasons against stale knowledge. Re-ingest after major refactors with
mem0 reindex --workspace myrepo. - vLLM prefix cache invalidation on every request. If your TTFT is 200-500ms instead of <50ms after the first call, your system prompt is templating variable user data. Move the variable parts to the user message; system prompt should be byte-identical across the agent loop.
- Test suite hangs the agent indefinitely. Long-running tests (integration suites that boot a database) blow past OpenHands' default tool-call timeout silently. Set per-tool timeouts in the MCP config or wrap your test runner in a hard deadline.
Variations and alternatives
Where this stack is wrong for your situation, the swap-in alternatives:
Apple Silicon variation. Replace vLLM + RTX 4090 with MLX-LM + M3 Max 64GB. The rest of the stack is unchanged. Throughput drops ~30-40% vs a 4090 but you trade GPU heat for a battery-powered laptop. See the Apple Silicon AI stack for the dedicated path.
Surgical-edits variation. If you want git-integrated CLI editing rather than autonomous task execution, swap OpenHands for Aider. Same model + runtime + MCP layer; different agent paradigm.
Higher-throughput agent-loop variation. Replace vLLM with SGLang if your stack does >10 tool calls per task on a stable system prompt. RadixAttention's tree-structured KV cache makes shared-prefix workloads ~1.3-1.7x faster. See the SGLang operational review for when this swap pays off.
Larger-codebase variation. For repos >1M tokens of context, swap Mem0 for Zep or Graphiti — temporal knowledge-graph memory holds long-horizon context better than flat vector retrieval.
How to verify the stack is healthy
The smoke tests we run on this stack:
- Throughput:
curl -X POST http://localhost:8000/v1/completions ...with a 100-token prompt should sustain >30 tok/s on a 4090. If you're below 20 tok/s, vLLM picked the wrong kernels — checkNCCL_DEBUG=INFOoutput and pin the Docker image rather than running pip-installed. - TTFT: first-token latency for a cache-hit prefix should be <50ms. Repeat the same system prompt three times; the 2nd and 3rd should be much faster than the 1st. If they aren't, your prefix cache isn't hitting — see failure mode #5.
- End-to-end: ask the agent to fix a deliberately-broken test in a small repo. Should complete in <3 minutes. If it takes >5, the loop is wrong.
- Memory: close the agent, restart, ask “what did we change last session?” — Mem0 should surface the prior session's changes. If it doesn't, your memory provider isn't persisting.
Going deeper
The reference reading that backs every component pick in this stack:
- OpenHands catalog entry — agent architecture, Planning Mode details, MCP integration patterns.
- vLLM operational review — the runtime-specific operator detail (KV cache math, the
gpu_memory_utilizationknob, prefix cache invalidation). - Mem0 catalog entry — memory-layer integration patterns and tradeoffs vs Letta / Zep / Graphiti.
- /systems/mcp — the protocol layer the filesystem and git servers run on.
- Inference runtime ecosystem map — where vLLM sits relative to the alternatives, with the full landscape categorized.