Capability notes
Open-weight code models in 2026 target two patterns: fill-in-the-middle (FIM) for IDE autocomplete and instruction-following for chat-based generation. [DeepSeek Coder V3](/models/deepseek-coder-v3) leads with HumanEval pass@1 of 92.5% and MBPP pass@1 of 88.3% — within 5 points of Claude 3.7 Sonnet. [Codestral Mamba 7B](/models/codestral-mamba-7b), a Mamba-2 architecture, achieves FIM latency of 15–25ms on consumer GPUs — fast enough for real-time IDE autocomplete where above 40ms the user types ahead. [CodeGemma 7B](/models/codegemma-7b) delivers functional completion at 7B scale with HumanEval pass@1 of 56.1%.
FIM models split context at cursor position into prefix/suffix and generate the middle — this is how IDE autocomplete works ([Continue.dev](/tools/continue), [Cline](/tools/cline)). Instruction-following models receive full file or diff as context and generate replacements — how chat-based coding works ([Aider](/tools/aider), [Cursor](/tools/cursor)).
Language coverage varies. [DeepSeek Coder V3](/models/deepseek-coder-v3) and [Qwen 3 32B](/models/qwen-3-32b) cover Python, JavaScript, TypeScript, Java, C++, Go, Rust, and SQL at similar quality. Specialized languages (Kotlin, Swift, R, MATLAB) see 15–30% lower pass@1. Proprietary codebases with internal frameworks produce substantially lower accuracy — the model has never seen your company's internal library APIs.
The operational insight: code generation quality correlates with training data coverage more than parameter count. A 16B model trained extensively on a language outperforms a 70B general model on that language by 10–20% pass@1. For multi-language teams, a general 32B+ model is more practical than per-language specialized models.
If you just want to try this
Lowest-friction path to a working setup.
Install [Continue.dev](/tools/continue) as a VS Code / JetBrains extension, then add local models. Edit `~/.continue/config.json`:
For autocomplete, use [Codestral Mamba 7B](/models/codestral-mamba-7b) for 15–25ms FIM latency:
```json
{
"tabAutocompleteModel": {
"title": "Codestral Mamba 7B",
"provider": "ollama",
"model": "codestral-mamba:7b"
}
}
```
For chat, use [Qwen 3 32B](/models/qwen-3-32b) or [DeepSeek Coder V3](/models/deepseek-coder-v3):
```json
{
"models": [{
"title": "Qwen 3 32B",
"provider": "ollama",
"model": "qwen3:32b"
}]
}
```
Pull the models first: `ollama pull codestral-mamba:7b` and `ollama pull qwen3:32b`.
Hardware: 6 GB VRAM minimum for 7B autocomplete, 16 GB+ for simultaneous 32B chat. On an [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb), both models coexist with ~2 GB headroom. On 8 GB GPUs, run only 7B autocomplete locally and point chat to OpenRouter API for 70B-class.
Simpler single-path: install [LM Studio](/tools/lm-studio), download [DeepSeek Coder V3](/models/deepseek-coder-v3) GGUF at Q4_K_M (~16 GB), start local server, and point Continue.dev's chat model to `http://localhost:1234/v1`.
Start with 7B autocomplete + 32B chat before reaching for 70B. 70B models produce 10–15% higher pass@1 but at 2–3× the latency — a 500ms suggestion is worse than a 150ms slightly-less-accurate one.
For production deployment
Operator-grade recommendation.
Production code generation splits into two latency budgets: IDE autocomplete (<80ms total) and chat-based generation (<2s TTFB, <10s full response).
**IDE autocomplete (~80ms budget):** keystroke → VS Code → Continue.dev → HTTP to local server → FIM inference → streaming → diff rendering. LAN round-trip to local GPU: 2–5ms. Cloud API: 50–150ms plus TLS. Local [RTX 4090](/hardware/rtx-4090) + [llama.cpp](/tools/llama-cpp) server + [Codestral Mamba 7B](/models/codestral-mamba-7b): 15–25ms inference + 3ms HTTP = ~25ms total. Same model via OpenRouter: 80–200ms — outside budget.
**Chat-based generation (<2s TTFB):** [Qwen 3 32B](/models/qwen-3-32b) on [RTX 4090](/hardware/rtx-4090) via [vLLM](/tools/vllm): 300–500ms TTFB, 50–70 tok/s. [DeepSeek Coder V3](/models/deepseek-coder-v3) on [RTX 5090](/hardware/rtx-5090): 400–600ms TTFB, 35–50 tok/s. [DeepSeek V4](/models/deepseek-v4) for reasoning-heavy tasks requires [H100](/hardware/nvidia-h100-pcie) or [MI300X](/hardware/amd-mi300x): 1.5–3s TTFB via FP8.
**When API vs local.** API wins for teams under 10 devs generating <100 completions/day — $20–30/mo Copilot subscription is cheaper than a GPU. Local wins when: (1) 10+ developers sharing inference hardware, (2) proprietary codebase cannot be sent to cloud APIs, (3) guaranteed <80ms autocomplete latency required (no API SLA under 200ms), or (4) fine-tuned models on internal codebase — cannot deploy to most APIs.
**Infrastructure.** Single [RTX 6000 Ada](/hardware/rtx-6000-ada) or [L40S](/hardware/nvidia-l40s) per 15–25 developers, running two vLLM instances: 8 GB pinned for 7B autocomplete, remainder for 32B–70B chat. Use [Ollama](/tools/ollama) only for single-dev — vLLM continuous batching is necessary at 5+ concurrent developers. Monitor autocomplete p95 latency weekly: crossing 100ms p95 loses developer adoption.
**Model by task.** Autocomplete: [Codestral Mamba 7B](/models/codestral-mamba-7b) for latency, [CodeGemma 7B](/models/codegemma-7b) for broad language coverage. Chat/edit: [DeepSeek Coder V3](/models/deepseek-coder-v3) for accuracy, [Qwen 3 32B](/models/qwen-3-32b) for accuracy/latency ratio. Frontier reasoning for multi-file refactors: [DeepSeek V4](/models/deepseek-v4) or Claude API.
What breaks
Failure modes operators see in the wild.
**Hallucinated API calls.** Symptom: generated code calls functions that do not exist. [DeepSeek Coder V3](/models/deepseek-coder-v3) hallucinates APIs on 8–12% of completions for Python packages outside the top 500 PyPI projects. Cause: training data lacks specific library version or API changed post-cutoff. Mitigation: RAG with project's dependency file to inject actual API signatures, lint generated code for undefined references, pin to newest model.
**Security vulnerabilities in generated code.** Symptom: SQL concatenation instead of parameterized queries, hardcoded credentials, missing sanitization. A 2025 Snyk study found 28% of LLM-generated code contained OWASP Top-10 vulnerabilities. Cause: training data includes vulnerable patterns from public repos. Mitigation: pipe all generated code through static analysis (Semgrep, Bandit, CodeQL) before PR entry, enforce SAST in CI on all branches, require human review for AI-authored commits.
**Language-specific blind spots.** Symptom: idiomatic Python but unidiomatic Rust, Go, or TypeScript. Cause: training data imbalance — Python and JavaScript comprise ~60% of code training data; Rust ~5%, Go ~8%. Mitigation: use models that report language-specific HumanEval scores, enforce project-specific style guides via linter.
**Context window truncation cutting off imports.** Symptom: generated code lacks import statements because the FIM prefix didn't include them. Cause: IDE context window split at cursor position drops the import section. Mitigation: configure IDE plugin to always include first 200 lines (import block), use tree-sitter to include all import statements programmatically in the prompt.
**Temperature-too-high drift on long files.** Symptom: output diverges from intended style after 50+ lines — inconsistent naming, formatting, logic. Cause: autoregressive error accumulation — each token conditioned on slightly-off previous tokens compounds. Mitigation: use lower temperature (0.1–0.3) for targets above 30 lines, generate in chunks with file re-grounding between, apply formatter (black, prettier, gofmt) to output.
**Fill-in-middle position sensitivity.** Symptom: FIM quality varies by cursor position — function-boundary completions 15–25% more accurate than mid-expression. Cause: FIM training constructs examples with artificial midpoint splits. Mitigation: configure IDE plugin to align FIM boundaries with AST nodes (function boundaries), fall back to line completion for mid-expression.
Hardware guidance
**Hobbyist tier ($300–600 GPU).** [RTX 3060 12GB](/hardware/rtx-3060-12gb) runs 7B autocomplete at 20–30ms per suggestion — functional for single-dev IDE integration. [Intel Arc B580](/hardware/intel-arc-b580) at 12 GB via SYCL: 25–40ms per suggestion. Neither supports simultaneous autocomplete + chat — 12 GB consumed by one 7B model. [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) at 36–48 GB unified runs both simultaneously via Metal.
**SMB tier ($1,500–2,500 GPU).** [RTX 4090](/hardware/rtx-4090) at 24 GB is the sweet spot — 7B autocomplete (6 GB) + 32B chat (18 GB) simultaneously at full GPU speed. [RTX 5080](/hardware/rtx-5080) at 16 GB: 7B + 32B Q4 partial offload, chat latency 30–50% higher than 4090. [RTX 5090](/hardware/rtx-5090) at 32 GB: 7B + 32B FP16 with 5–8 GB headroom — best single-card code workstation.
**Team serving tier ($6,000–20,000).** [RTX 6000 Ada](/hardware/rtx-6000-ada) (48 GB) or [L40S](/hardware/nvidia-l40s) (48 GB) per 15–25 developers. Split: 8 GB for 7B autocomplete, 40 GB for 32B–70B chat. At 48 GB, 70B Q4 fits with 8 GB KV cache headroom. For 25+ developers, add a second identical card with vLLM load balancing. For 50+ developers requiring 70B chat: [A100 80GB SXM](/hardware/nvidia-a100-80gb-sxm) — 70B FP8 with 40+ GB KV cache at 130–150 tok/s.
**Enterprise tier ($25,000+).** [H100 PCIe](/hardware/nvidia-h100-pcie) at 80 GB with 2.0 TB/s handles 50–100 concurrent dev sessions on 70B chat with continuous batching. TTFB stays under 800ms at 50 concurrent. For reasoning-heavy code generation, [DeepSeek V4](/models/deepseek-v4) on [H200](/hardware/nvidia-h200) (141 GB) or [MI300X](/hardware/amd-mi300x) (192 GB).
Memory bandwidth matters more than TFLOPS for code workloads. The autocomplete model reads full prefix on every keystroke — bandwidth-bound. [RTX 3090](/hardware/rtx-3090) at 936 GB/s often outperforms [RTX 4070](/hardware/rtx-4070) on FIM latency despite lower compute, because the 3090's 384-bit bus delivers 1.9× bandwidth. Rank by bandwidth first, VRAM second, compute third.
Runtime guidance
**Continue.dev vs Cursor vs Aider — three distinct architectures.**
[Continue.dev](/tools/continue) is an open-source IDE plugin (VS Code + JetBrains) connecting to any local or remote model via Ollama, llama.cpp, or OpenAI-compatible API. It provides tab autocomplete (FIM, low-latency) and chat sidebar (instruction-following). Setup: install extension, point `~/.continue/config.json` at local Ollama. Tradeoff: features lag Cursor 6–12 months — no agentic mode, no multi-file edit preview, no test-run integration.
[Cursor](/tools/cursor) is a proprietary VS Code fork with integrated cloud models and a "Composer" agentic mode for multi-file edits. Cursor's agentic features require Anthropic or OpenAI models — local models don't power agentic features. Best-in-class AI coding UX at $20/mo. Tradeoff: your code goes to Anthropic/OpenAI servers.
[Aider](/tools/aider) is a terminal-based pair-programming tool connecting to any LLM (local or cloud) and generating git commits for each change. Architecture: map-reduce with repo map + relevant files → model generates search/replace blocks. Works with any model outputting valid edit formats. Best for git-tracked, reviewable AI edits with maximum model portability. Tradeoff: terminal-only, no inline autocomplete.
**Backend comparison.** [Ollama](/tools/ollama) for Continue.dev/Aider: simplest setup, one binary, auto GPU detection. Tradeoff: single concurrent request, no FIM API (only /chat and /generate, not /fim). [llama.cpp](/tools/llama-cpp) server: exposes /completion with FIM via `--cont-batching`, lower latency than Ollama by avoiding chat template layer, supports grammar-constrained output. Use llama.cpp directly when FIM latency matters.
**Recommendation.** Single developer: [Continue.dev](/tools/continue) + [Ollama](/tools/ollama) for chat + [llama.cpp](/tools/llama-cpp) server for autocomplete. Team 5–25: [Continue.dev](/tools/continue) → shared [vLLM](/tools/vllm) instances (7B autocomplete + 32B–70B chat). Max capability regardless of local constraint: [Cursor](/tools/cursor) + Claude 3.7 Sonnet for agentic + local [Ollama](/tools/ollama) for free autocomplete via [Codestral Mamba 7B](/models/codestral-mamba-7b). Terminal-native: [Aider](/tools/aider) + [DeepSeek Coder V3](/models/deepseek-coder-v3) via local [Ollama](/tools/ollama).