Capability notes
Coding agents in 2026 — [Aider](/tools/aider), [Cline](/tools/cline), and OpenHands — operate in a loop: read repo context → plan edit → apply diff → run tests → evaluate results → iterate. SWE-bench Verified scores: Aider + Claude 3.7 Sonnet = 48.5%, OpenHands + DeepSeek V4 = 51.2% — autonomously resolving roughly half of real-world GitHub issues requiring multi-file edits. This is up from 15–25% in early 2024, driven by frontier model improvements and better agent architectures.
Agentic coding differs from code generation in scope: a completion tool suggests one function; an agent modifies 3–15 files, writes tests, runs them, and iterates on failures. The agent maintains state across tool calls — reading files, executing commands, parsing errors, applying fixes.
Model quality drives agent performance more than agent architecture. The same framework with [DeepSeek V4](/models/deepseek-v4) scores 2–3× higher on SWE-bench than with [Llama 3.3 70B](/models/llama-3-3-70b). Frontier MoE models handle multi-step reasoning, tool-use sequencing, and error recovery better. Serious agentic coding requires 70B-class or frontier MoE. 32B models handle single-file edits and simple test-fix loops. 7B models cannot reliably complete agentic workflows — they lose context after 3–5 tool calls.
The architecture that works: architect mode (planning model) + edit mode (execution model). The architect plans the multi-file change; the editor applies specific diffs. This separation reduces context pollution — the architect reasons about the full repo while the editor works on one file at a time. [Aider](/tools/aider) implements this with `--architect`; [Cline](/tools/cline) via "Plan" vs "Act" mode.
If you just want to try this
Lowest-friction path to a working setup.
Start with [Aider](/tools/aider) using [DeepSeek V3](/models/deepseek-v3) via OpenRouter — zero hardware setup:
```bash
pip install aider-chat
export OPENROUTER_API_KEY="sk-or-v1-..."
aider --model openrouter/deepseek/deepseek-chat-v3
```
This gives you a terminal-based agent that reads your repo, makes multi-file edits, stages git commits, and iterates on test failures. [DeepSeek V3](/models/deepseek-v3) via OpenRouter costs ~$0.89/M input tokens and ~$1.10/M output — a typical hour-long session costs $3–8.
Move to local for zero-cost agentic coding once comfortable. Install [Ollama](/tools/ollama):
```bash
ollama pull llama3.3:70b-instruct-q4_K_M
aider --model ollama_chat/llama3.3:70b-instruct-q4_K_M
```
Hardware: 24 GB+ VRAM for 70B Q4 — [RTX 4090](/hardware/rtx-4090), [RTX 5090](/hardware/rtx-5090), or [RTX 3090](/hardware/rtx-3090). On [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) 64 GB+, the model runs on SoC.
For [Cline](/tools/cline), install the VS Code extension, configure API provider to "Ollama" at `http://localhost:11434`, select [Llama 3.3 70B](/models/llama-3-3-70b) or [DeepSeek V4](/models/deepseek-v4). Cline's VS Code integration gives inline diff previews missing from Aider's terminal.
Start with small tasks: "fix typo in README," "add unit test for parse_config," "refactor this 80-line function into two helpers." Graduate to multi-file refactors once you know when the agent succeeds vs needs human guidance. Rule of thumb: if the change can be described in 2–3 sentences with clear file paths and function names, the agent can do it. If design decisions span ambiguous requirements, supervise manually.
For production deployment
Operator-grade recommendation.
Production agentic pipelines combine an orchestrator, model backend, sandboxed execution, and monitoring.
**Planning model (architect):** frontier-tier reasoning ([DeepSeek V4](/models/deepseek-v4), Claude 3.7 Sonnet via API). Reads issue description, explores repo, selects relevant files, drafts multi-file edit plan. A single plan invocation costs 20K–50K input + 2K–5K output tokens. On [DeepSeek V4](/models/deepseek-v4) via [vLLM](/tools/vllm) on [H100 PCIe](/hardware/nvidia-h100-pcie): 15–30 seconds.
**Editing model (executor):** 70B-class ([Llama 3.3 70B](/models/llama-3-3-70b), [Qwen 3 32B](/models/qwen-3-32b)). Applies individual file diffs from the plan, receiving file content + specific edit instruction. Each diff costs 3K–10K tokens. Local execution avoids per-token API charges on the high-volume editing step.
**Sandbox:** every agent-generated command executes in an isolated container (Docker/Podman) with restricted network egress and filesystem limited to a repo clone. [Aider](/tools/aider)'s `--no-auto-commits` + `--yes` with Docker wrapper provide basic sandboxing. [Cline](/tools/cline)'s "require approval" mode flags dangerous commands (rm -rf, git push --force, curl to unknown hosts, anything touching /etc or /home outside project dir).
**Test-fix loop budget:** each iteration costs 30–90 seconds with local inference. Cap at 5 iterations per issue. Beyond 5 without passing tests, escalate to human. Track "fix ratio" — % of iterations that improve test results. Below 40% means model is thrashing — terminate and restart with revised plan.
**When agents break.** Agents fail when: (1) the issue requires understanding undocumented architecture decisions, (2) the fix touches 10+ files with interdependencies (context window exceeds 128K), (3) the test suite takes 60+ seconds, or (4) the agent encounters a novel error never seen in training. Implement circuit-breaker: 3 consecutive no-improvement iterations → halt. Total runtime exceeds 15 minutes → halt.
**Cost.** Local inference on [H100 PCIe](/hardware/nvidia-h100-pcie) at ~$2.50/hr processes ~200–400 iterations/hour. API-based agents cost $0.15–0.75/iteration. For 50 issues/day at 4 iterations each = 200 iterations/day, local saves $28–148/day vs API — a $10,000–54,000/year differential.
What breaks
Failure modes operators see in the wild.
**Agent loop divergence.** Symptom: agent repeatedly edits same files, never converging — codebase worse than baseline after 5+ iterations. Cause: model lacks global understanding of side effects — fixes A, breaks B. Mitigation: cap at 5 iterations, require new planning step before each iteration above 3, alert when any file touched more than twice.
**Infinite repair cycles.** Symptom: "apply edit → tests fail → apply same fix → same failure → repeat." Cause: error-recovery reasoning generates the same fix because it cannot identify root cause from test output alone. Mitigation: deduplication check — if the same diff is proposed twice, stop and inject the actual error message with directive to "explain root cause before generating fix."
**Context pollution.** Symptom: after 10+ tool calls, agent hallucinates file contents, references nonexistent variables, proposes edits to wrong files. Cause: full conversation history including tool outputs accumulates — by iteration 8 at 128K context, 70% is tool output history. Mitigation: architect-editor pattern — architect gets full repo context; editor gets only current file + specific edit instruction. Archive tool outputs after each iteration; summarize rather than retaining raw stdout.
**Git state corruption.** Symptom: agent commits broken changes, pushes to main directly, force-pushes, or creates merge conflicts. Cause: agent tool-use permissions grant git commands without guardrails. Mitigation: never grant git push. Restrict git commit to dedicated agent branch. Require .agent-guardrails file blocking force push, checkout main, hard reset. Validate git state before and after each invocation.
**Unsafe command execution.** Symptom: agent runs rm -rf, chmod -R 777, unchecked SQL migrations, downloads remote scripts. Cause: model treats all shell commands as equally valid with no risk model. Mitigation: run in container with read-only rootfs except project clone, block network egress during agent runs, maintain blocklist (rm -rf, chmod -R, git push --force, sudo, pip install, curl | bash) requiring explicit human approval.
**Test blindness.** Symptom: agent claims "all tests pass" but invocation was a no-op (wrong runner, wrong directory, tests skipped). Cause: model conflates "test exit code 0" with "tests actually ran." Mitigation: require agent to report exact command, number of tests run, and duration. Parse output — "0 tests ran" = failure.
Hardware guidance
**Hobbyist tier ($1,500–2,500 GPU).** Agents are the most demanding local AI workload. [RTX 4090](/hardware/rtx-4090) at 24 GB runs 70B Q4 with 16K–32K context — minimum viable consumer card for unilateral agentic work. Expect 22–28 tok/s, iterations of 45–90 seconds. [RTX 5090](/hardware/rtx-5090) at 32 GB: 40–55 tok/s at 32K context — iterations drop to 25–50 seconds. Dual [RTX 3090](/hardware/rtx-3090) (48 GB): 25–35 tok/s at 64K context — the used-market sweet spot. [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) at 128 GB: 20–30 tok/s with 64K+ context — the laptop pick.
**SMB tier ($6,000–15,000).** [RTX 6000 Ada](/hardware/rtx-6000-ada) at 48 GB with 960 GB/s: 80–100 tok/s — iterations drop to 15–25 seconds. Makes agentic coding feel interactive. [L40S](/hardware/nvidia-l40s) at 48 GB: similar datacenter performance. Single card handles 5–8 concurrent agent instances via vLLM continuous batching.
**Enterprise tier ($25,000+).** [H100 PCIe](/hardware/nvidia-h100-pcie) at 80 GB with 2.0 TB/s: 140–170 tok/s — iterations drop to 8–15 seconds. 80 GB fits 70B FP8 with 64K context plus KV cache for concurrent sessions. [H200](/hardware/nvidia-h200) at 141 GB: fits [DeepSeek V4](/models/deepseek-v4) at FP8 with 128K context — 50+ concurrent agent instances. [AMD MI300X](/hardware/amd-mi300x) at 192 GB: fits DeepSeek V4 at FP16 with 128K context and 100+ concurrent sessions.
**Context window scaling.** Each 1K tokens of context consumes ~0.8–1.2 GB KV cache for 70B models. 24 GB GPU running 70B Q4 (~40 GB) with partial offload: ~8 GB for KV = ~8K context — borderline. 48 GB GPU running 70B FP8 (~35 GB): ~13 GB for KV = ~16K context — adequate. 80 GB GPU: ~45 GB for KV = ~64K context — comfortable. Choose hardware by context window needed: 16K minimum for basic agentic, 32K for multi-file refactors, 64K for repository-scale.
Runtime guidance
**Aider vs Cline vs OpenHands — agent architecture determines model compatibility.**
[Aider](/tools/aider) is a terminal-based agent: repo-map → prompt assembly → model response → search/replace block extraction → file edit → git commit → test run → iterate. Aider uses search/replace blocks as edit format, making it the most model-agnostic agent — works with 70B local models, frontier cloud APIs, and everything in between. Architect mode (`--architect`) splits planning and editing across two models for editing). Tradeoff: terminal-only, no GUI diff preview (rely on git diff), repo-map generation adds latency.
[Cline](/tools/cline) is a VS Code extension acting as autonomous agent with file read/write, terminal execution, browser access, and MCP tool integration. Architecture: user prompt → plan → execute tool → observe → replan → repeat. Works with any OpenAI-compatible API including local [Ollama](/tools/ollama) and [vLLM](/tools/vllm). Advantage: deep VS Code integration — inline diffs, per-hunk accept/reject, real-time observation. Tradeoff: tool-use event loop consumes 2–4× more context per iteration than Aider's structured edit format.
**OpenHands** (formerly OpenDevin) is a web-based platform running in sandboxed Docker with full shell, filesystem, and browser. CodeAct architecture: task description → write code → execute → iterate. Achieves highest SWE-bench Verified scores (51.2% with [DeepSeek V4](/models/deepseek-v4)) among open-weight agent frameworks. Tradeoff: web UI deployment complexity, Docker sandbox infrastructure, local model support depends on vLLM integration.
**Architect vs unified mode.** Architect mode: planning model receives full repo context (file tree + docstrings via repo-map) and produces change plan; editing model receives individual files + specific instructions. Reduces context consumption 40–60%. Essential for local models with limited context. Unified mode: single model plans and edits — simpler but requires larger context and stronger models. Works with [DeepSeek V4](/models/deepseek-v4); degrades with [Llama 3.3 70B](/models/llama-3-3-70b) after 3–4 iterations.
**Claude API vs local.** Claude 3.7 Sonnet scores 60–80% higher SWE-bench than best open-weight 70B, recovers from tool-use errors 3× more reliably, handles 200K context natively. Tradeoff: $3/M input, $15/M output, code goes to Anthropic.