Local AI for developers
Coding agents that talk to a local model: Continue.dev, Cline, Aider, and Cursor with a local backend. What each is for, the OpenAI-compatible API plumbing they all share, and what local code models can and can't do in 2026.
Answer first
In 2026 a 14-32B local model running through Ollama with the right coding agent on top — Continue.dev for in-editor work, Cline for agentic file edits, Aider for CLI repo work, or Cursor pointed at a local backend — gives you most of the daily-driver coding-assistant experience without a per-seat cloud bill and without your code leaving the machine. Frontier cloud agents still beat this on the hardest reasoning tasks (large unfamiliar codebases, multi-file architectural refactors), but they don't beat it on the tab-completion-and-tight-edits work that fills most of the day.
The whole stack is interoperable because all four agents speak the OpenAI Chat Completions API, which Ollama, LM Studio, vLLM, and SGLang all expose. You change the base URL, you change the model name, the agent doesn't care.
The OpenAI-compatible API plumbing
The single insight that makes local-AI-for-developers click: every modern local-runtime implementation exposes an HTTP server compatible with the OpenAI Chat Completions and Embeddings APIs on a localhost port. Once you understand that, the whole ecosystem is interchangeable.
- Ollama:
http://localhost:11434/v1. Default for most desktop setups. - LM Studio:
http://localhost:1234/v1. GUI with a built-in server toggle. - vLLM:
http://localhost:8000/v1. Production-grade, handles concurrent requests, used for homelab and small-team serving. - SGLang:
http://localhost:30000/v1. Faster on multi-GPU concurrent serving than vLLM in some configurations.
Any tool that accepts an “OpenAI base URL” setting — Continue.dev, Cline, Aider, Cursor, your own scripts — can point at any of these. The model is local, the API surface is the one you already know. Our internal API documentation is at /api-docs.
The four coding agents that work with local
1. Continue.dev — VS Code (and JetBrains) extension. The most polished “feels like Copilot but local” experience. In-line completions, side-panel chat, edit-suggestions on highlighted code. You configure a single config file pointing at your Ollama or LM Studio endpoint and pick which model handles which role (autocomplete vs chat vs edit). Best for: developers who want a Copilot-style daily driver without sending code to GitHub.
2. Cline — VS Code extension for agentic edits. Goes beyond completions: you describe a multi-file change in natural language and Cline plans, edits, and shows diffs across files for your approval. Works with any OpenAI-compatible endpoint, which means the same Ollama backend you use for Continue. Best for: tasks that span 3-10 files where the alternative is doing it all by hand. The agentic loop costs more tokens than chat completion, so plan model size and context accordingly.
3. Aider — CLI-first repo agent. Lives in a terminal, knows about git, makes commits with diffs you review. The interaction model is “tell Aider what you want, watch it propose changes, accept or revise, repeat.” Excellent for repos where you want the AI's changes to be reviewable in git history, not just inline in your editor. Best for: backend developers, infra-as-code work, anyone who lives in a terminal more than an IDE.
4. Cursor with a local backend. Cursor is a fork of VS Code with deep AI integration. It defaults to its own hosted models, but you can point it at any OpenAI-compatible endpoint via the model settings, including a local Ollama. The local-backend experience is rougher than the hosted default — some Cursor features (background indexing, certain kinds of long-context retrieval) work better against hosted endpoints — but if Cursor is already your editor and you want to keep code local, it's a real option.
The full coding-agents map with feature comparisons is at /maps/coding-agents-2026.
What local code models can and can't do
Operator-grade honesty about the 2026 capability ceiling. The relevant models are Qwen 2.5 Coder 14B/32B, DeepSeek Coder V2, and CodeLlama. The ranges below are what experienced operators report, not best-case marketing.
What local does well. Tab completion in a familiar codebase (the model has the surrounding code as context, doesn't need to invent anything). Single-file refactors and small bug fixes. Boilerplate generation when the structure is well-defined. Test scaffolding. Renaming and migration tasks. Documentation and README writing. These are 80% of daily-driver coding work.
What local does adequately. Multi-file edits within a small repo (3-10 files, <5K lines) when guided by Cline or Aider. Code review of small-to-medium pull requests. Translating short scripts from one language to another. Explaining unfamiliar code. SQL drafting against a known schema.
What local genuinely struggles with. Large unfamiliar codebases (100K+ lines) where the model needs to hold a lot of architecture in context. Novel algorithm design where the model needs to reason across multiple unfamiliar concepts. Long-context analysis where a frontier 200K-context model has a real edge. The newest framework idioms in fast-moving ecosystems where the model's training data is six months stale.
Latency budget — what makes inline completion feel right
Coding completion is a latency game, not a throughput game. The honest target for an inline tab-complete experience is 30+ decode tok/s on the autocomplete model and under 400 ms time-to-first-token on a 1-2K-token prefill. Below those numbers, the suggestion arrives after you've already written the line yourself, and the feature stops earning its keep.
On a 24 GB consumer GPU running Qwen 2.5 Coder 14B at Q4 in vLLM or Ollama, single-stream decode lands in the 60-90 tok/s band and TTFT in the low hundreds of milliseconds — comfortably above the felt-right floor. On a 16 GB GPU running 7B-class models the same numbers are 80-120 tok/s decode and even lower TTFT, the trade-off being a noticeable drop in code quality on harder tasks. On Apple Silicon M3/M4 Max running MLX, decode is 35-55 tok/s on 14B Q4 and TTFT is fine for inline use, with the unified-memory advantage that 32B suddenly becomes practical for chat-and-edit even when it would OOM a 24 GB discrete card. Cross-check whatever number you hit against the catalog at /benchmarks; the broader engine-choice framing for why Ollama vs vLLM vs llama.cpp matters for this workload is in the engine choice matrix.
The recommended stack for most developers
After many operator reports across 2025-2026, the stack that lands well for most developers looks like this:
- Runtime: Ollama on Mac/Linux, LM Studio on Windows. Both expose OpenAI-compatible APIs out of the box.
- Models: Qwen 2.5 Coder 14B Q4 for autocomplete (fast, small context), Qwen 2.5 Coder 32B Q4 for chat and edits (more capable, larger context — only on 24 GB+ GPU), or DeepSeek Coder V2 Lite for 16 GB cards.
- In-editor: Continue.dev with its model-roles config. Different models for autocomplete vs chat vs edit, all pointing at the same Ollama instance.
- For multi-file changes: Cline or Aider, depending on whether you prefer GUI or CLI.
- Hardware floor: 12-16 GB GPU or Apple Silicon with 24+ GB unified memory. Below that, the experience works but the 14B class doesn't fit and you fall back to 7B which is noticeably weaker on code-specific work.
The full operator-grade workflow with hardware tiers and failure modes is in /workflows/local-coding-agent-system; the learning path is at /paths/local-coding-agent.
When you still need cloud frontier
Three concrete cases where reaching for a frontier cloud model is the right call. First, opening an unfamiliar 500K-line codebase and asking “how does authentication flow through this?” — the long-context cloud frontier still has the edge. Second, novel algorithm design where you need a model to reason across multiple unfamiliar mathematical concepts. Third, very-recent framework adoption where the local model's training cutoff is older than the API you're using.
The hybrid pattern most working developers settle into: local for the 80% (autocomplete, single-file edits, refactors, testing, docs), cloud frontier for the 20% (architecture-spanning questions, frontier reasoning). With a $20-30/month single cloud subscription plus a local stack you already paid for, you cover both ends without paying per-seat at every layer.
Next recommended step
Stack assembly, hardware tiers, model picks, and failure modes.
Hardware that fits this stack: RTX 4090 24 GB for fast inference, used RTX 3090 24 GB for the budget tier, MacBook Pro M4 Max for laptop-first developers.