Code Generation

Capability notes

Open-weight code models in 2026 target two patterns: fill-in-the-middle (FIM) for IDE autocomplete and instruction-following for chat-based generation. [DeepSeek Coder V3](/models/deepseek-coder-v3) leads with HumanEval pass@1 of 92.5% and MBPP pass@1 of 88.3% — within 5 points of Claude 3.7 Sonnet. [Codestral Mamba 7B](/models/codestral-mamba-7b), a Mamba-2 architecture, achieves FIM latency of 15–25ms on consumer GPUs — fast enough for real-time IDE autocomplete where above 40ms the user types ahead. [CodeGemma 7B](/models/codegemma-7b) delivers functional completion at 7B scale with HumanEval pass@1 of 56.1%. FIM models split context at cursor position into prefix/suffix and generate the middle — this is how IDE autocomplete works ([Continue.dev](/tools/continue), [Cline](/tools/cline)). Instruction-following models receive full file or diff as context and generate replacements — how chat-based coding works ([Aider](/tools/aider), [Cursor](/tools/cursor)). Language coverage varies. [DeepSeek Coder V3](/models/deepseek-coder-v3) and [Qwen 3 32B](/models/qwen-3-32b) cover Python, JavaScript, TypeScript, Java, C++, Go, Rust, and SQL at similar quality. Specialized languages (Kotlin, Swift, R, MATLAB) see 15–30% lower pass@1. Proprietary codebases with internal frameworks produce substantially lower accuracy — the model has never seen your company's internal library APIs. The operational insight: code generation quality correlates with training data coverage more than parameter count. A 16B model trained extensively on a language outperforms a 70B general model on that language by 10–20% pass@1. For multi-language teams, a general 32B+ model is more practical than per-language specialized models.

If you just want to try this

Lowest-friction path to a working setup.

Install [Continue.dev](/tools/continue) as a VS Code / JetBrains extension, then add local models. Edit `~/.continue/config.json`: For autocomplete, use [Codestral Mamba 7B](/models/codestral-mamba-7b) for 15–25ms FIM latency: ```json { "tabAutocompleteModel": { "title": "Codestral Mamba 7B", "provider": "ollama", "model": "codestral-mamba:7b" } } ``` For chat, use [Qwen 3 32B](/models/qwen-3-32b) or [DeepSeek Coder V3](/models/deepseek-coder-v3): ```json { "models": [{ "title": "Qwen 3 32B", "provider": "ollama", "model": "qwen3:32b" }] } ``` Pull the models first: `ollama pull codestral-mamba:7b` and `ollama pull qwen3:32b`. Hardware: 6 GB VRAM minimum for 7B autocomplete, 16 GB+ for simultaneous 32B chat. On an [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb), both models coexist with ~2 GB headroom. On 8 GB GPUs, run only 7B autocomplete locally and point chat to OpenRouter API for 70B-class. Simpler single-path: install [LM Studio](/tools/lm-studio), download [DeepSeek Coder V3](/models/deepseek-coder-v3) GGUF at Q4_K_M (~16 GB), start local server, and point Continue.dev's chat model to `http://localhost:1234/v1`. Start with 7B autocomplete + 32B chat before reaching for 70B. 70B models produce 10–15% higher pass@1 but at 2–3× the latency — a 500ms suggestion is worse than a 150ms slightly-less-accurate one.

For production deployment

Operator-grade recommendation.

Production code generation splits into two latency budgets: IDE autocomplete (<80ms total) and chat-based generation (<2s TTFB, <10s full response). **IDE autocomplete (~80ms budget):** keystroke → VS Code → Continue.dev → HTTP to local server → FIM inference → streaming → diff rendering. LAN round-trip to local GPU: 2–5ms. Cloud API: 50–150ms plus TLS. Local [RTX 4090](/hardware/rtx-4090) + [llama.cpp](/tools/llama-cpp) server + [Codestral Mamba 7B](/models/codestral-mamba-7b): 15–25ms inference + 3ms HTTP = ~25ms total. Same model via OpenRouter: 80–200ms — outside budget. **Chat-based generation (<2s TTFB):** [Qwen 3 32B](/models/qwen-3-32b) on [RTX 4090](/hardware/rtx-4090) via [vLLM](/tools/vllm): 300–500ms TTFB, 50–70 tok/s. [DeepSeek Coder V3](/models/deepseek-coder-v3) on [RTX 5090](/hardware/rtx-5090): 400–600ms TTFB, 35–50 tok/s. [DeepSeek V4](/models/deepseek-v4) for reasoning-heavy tasks requires [H100](/hardware/nvidia-h100-pcie) or [MI300X](/hardware/amd-mi300x): 1.5–3s TTFB via FP8. **When API vs local.** API wins for teams under 10 devs generating <100 completions/day — $20–30/mo Copilot subscription is cheaper than a GPU. Local wins when: (1) 10+ developers sharing inference hardware, (2) proprietary codebase cannot be sent to cloud APIs, (3) guaranteed <80ms autocomplete latency required (no API SLA under 200ms), or (4) fine-tuned models on internal codebase — cannot deploy to most APIs. **Infrastructure.** Single [RTX 6000 Ada](/hardware/rtx-6000-ada) or [L40S](/hardware/nvidia-l40s) per 15–25 developers, running two vLLM instances: 8 GB pinned for 7B autocomplete, remainder for 32B–70B chat. Use [Ollama](/tools/ollama) only for single-dev — vLLM continuous batching is necessary at 5+ concurrent developers. Monitor autocomplete p95 latency weekly: crossing 100ms p95 loses developer adoption. **Model by task.** Autocomplete: [Codestral Mamba 7B](/models/codestral-mamba-7b) for latency, [CodeGemma 7B](/models/codegemma-7b) for broad language coverage. Chat/edit: [DeepSeek Coder V3](/models/deepseek-coder-v3) for accuracy, [Qwen 3 32B](/models/qwen-3-32b) for accuracy/latency ratio. Frontier reasoning for multi-file refactors: [DeepSeek V4](/models/deepseek-v4) or Claude API.

What breaks

Failure modes operators see in the wild.

**Hallucinated API calls.** Symptom: generated code calls functions that do not exist. [DeepSeek Coder V3](/models/deepseek-coder-v3) hallucinates APIs on 8–12% of completions for Python packages outside the top 500 PyPI projects. Cause: training data lacks specific library version or API changed post-cutoff. Mitigation: RAG with project's dependency file to inject actual API signatures, lint generated code for undefined references, pin to newest model. **Security vulnerabilities in generated code.** Symptom: SQL concatenation instead of parameterized queries, hardcoded credentials, missing sanitization. A 2025 Snyk study found 28% of LLM-generated code contained OWASP Top-10 vulnerabilities. Cause: training data includes vulnerable patterns from public repos. Mitigation: pipe all generated code through static analysis (Semgrep, Bandit, CodeQL) before PR entry, enforce SAST in CI on all branches, require human review for AI-authored commits. **Language-specific blind spots.** Symptom: idiomatic Python but unidiomatic Rust, Go, or TypeScript. Cause: training data imbalance — Python and JavaScript comprise ~60% of code training data; Rust ~5%, Go ~8%. Mitigation: use models that report language-specific HumanEval scores, enforce project-specific style guides via linter. **Context window truncation cutting off imports.** Symptom: generated code lacks import statements because the FIM prefix didn't include them. Cause: IDE context window split at cursor position drops the import section. Mitigation: configure IDE plugin to always include first 200 lines (import block), use tree-sitter to include all import statements programmatically in the prompt. **Temperature-too-high drift on long files.** Symptom: output diverges from intended style after 50+ lines — inconsistent naming, formatting, logic. Cause: autoregressive error accumulation — each token conditioned on slightly-off previous tokens compounds. Mitigation: use lower temperature (0.1–0.3) for targets above 30 lines, generate in chunks with file re-grounding between, apply formatter (black, prettier, gofmt) to output. **Fill-in-middle position sensitivity.** Symptom: FIM quality varies by cursor position — function-boundary completions 15–25% more accurate than mid-expression. Cause: FIM training constructs examples with artificial midpoint splits. Mitigation: configure IDE plugin to align FIM boundaries with AST nodes (function boundaries), fall back to line completion for mid-expression.

Hardware guidance

**Hobbyist tier ($300–600 GPU).** [RTX 3060 12GB](/hardware/rtx-3060-12gb) runs 7B autocomplete at 20–30ms per suggestion — functional for single-dev IDE integration. [Intel Arc B580](/hardware/intel-arc-b580) at 12 GB via SYCL: 25–40ms per suggestion. Neither supports simultaneous autocomplete + chat — 12 GB consumed by one 7B model. [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) at 36–48 GB unified runs both simultaneously via Metal. **SMB tier ($1,500–2,500 GPU).** [RTX 4090](/hardware/rtx-4090) at 24 GB is the sweet spot — 7B autocomplete (6 GB) + 32B chat (18 GB) simultaneously at full GPU speed. [RTX 5080](/hardware/rtx-5080) at 16 GB: 7B + 32B Q4 partial offload, chat latency 30–50% higher than 4090. [RTX 5090](/hardware/rtx-5090) at 32 GB: 7B + 32B FP16 with 5–8 GB headroom — best single-card code workstation. **Team serving tier ($6,000–20,000).** [RTX 6000 Ada](/hardware/rtx-6000-ada) (48 GB) or [L40S](/hardware/nvidia-l40s) (48 GB) per 15–25 developers. Split: 8 GB for 7B autocomplete, 40 GB for 32B–70B chat. At 48 GB, 70B Q4 fits with 8 GB KV cache headroom. For 25+ developers, add a second identical card with vLLM load balancing. For 50+ developers requiring 70B chat: [A100 80GB SXM](/hardware/nvidia-a100-80gb-sxm) — 70B FP8 with 40+ GB KV cache at 130–150 tok/s. **Enterprise tier ($25,000+).** [H100 PCIe](/hardware/nvidia-h100-pcie) at 80 GB with 2.0 TB/s handles 50–100 concurrent dev sessions on 70B chat with continuous batching. TTFB stays under 800ms at 50 concurrent. For reasoning-heavy code generation, [DeepSeek V4](/models/deepseek-v4) on [H200](/hardware/nvidia-h200) (141 GB) or [MI300X](/hardware/amd-mi300x) (192 GB). Memory bandwidth matters more than TFLOPS for code workloads. The autocomplete model reads full prefix on every keystroke — bandwidth-bound. [RTX 3090](/hardware/rtx-3090) at 936 GB/s often outperforms [RTX 4070](/hardware/rtx-4070) on FIM latency despite lower compute, because the 3090's 384-bit bus delivers 1.9× bandwidth. Rank by bandwidth first, VRAM second, compute third.

Runtime guidance

**Continue.dev vs Cursor vs Aider — three distinct architectures.** [Continue.dev](/tools/continue) is an open-source IDE plugin (VS Code + JetBrains) connecting to any local or remote model via Ollama, llama.cpp, or OpenAI-compatible API. It provides tab autocomplete (FIM, low-latency) and chat sidebar (instruction-following). Setup: install extension, point `~/.continue/config.json` at local Ollama. Tradeoff: features lag Cursor 6–12 months — no agentic mode, no multi-file edit preview, no test-run integration. [Cursor](/tools/cursor) is a proprietary VS Code fork with integrated cloud models and a "Composer" agentic mode for multi-file edits. Cursor's agentic features require Anthropic or OpenAI models — local models don't power agentic features. Best-in-class AI coding UX at $20/mo. Tradeoff: your code goes to Anthropic/OpenAI servers. [Aider](/tools/aider) is a terminal-based pair-programming tool connecting to any LLM (local or cloud) and generating git commits for each change. Architecture: map-reduce with repo map + relevant files → model generates search/replace blocks. Works with any model outputting valid edit formats. Best for git-tracked, reviewable AI edits with maximum model portability. Tradeoff: terminal-only, no inline autocomplete. **Backend comparison.** [Ollama](/tools/ollama) for Continue.dev/Aider: simplest setup, one binary, auto GPU detection. Tradeoff: single concurrent request, no FIM API (only /chat and /generate, not /fim). [llama.cpp](/tools/llama-cpp) server: exposes /completion with FIM via `--cont-batching`, lower latency than Ollama by avoiding chat template layer, supports grammar-constrained output. Use llama.cpp directly when FIM latency matters. **Recommendation.** Single developer: [Continue.dev](/tools/continue) + [Ollama](/tools/ollama) for chat + [llama.cpp](/tools/llama-cpp) server for autocomplete. Team 5–25: [Continue.dev](/tools/continue) → shared [vLLM](/tools/vllm) instances (7B autocomplete + 32B–70B chat). Max capability regardless of local constraint: [Cursor](/tools/cursor) + Claude 3.7 Sonnet for agentic + local [Ollama](/tools/ollama) for free autocomplete via [Codestral Mamba 7B](/models/codestral-mamba-7b). Terminal-native: [Aider](/tools/aider) + [DeepSeek Coder V3](/models/deepseek-coder-v3) via local [Ollama](/tools/ollama).

Setup walkthrough

Install Ollama from ollama.com.
ollama pull qwen2.5-coder:7b (~4.7 GB download).
ollama run qwen2.5-coder:7b and type: "Write a Python function that merges two sorted arrays in O(n+m) time."
First response in 2-5 seconds on 8 GB GPU.
For VS Code integration: install the Continue extension → configure to use Ollama with qwen2.5-coder:7b.
For a stronger coding model on 24 GB GPU: ollama pull deepseek-coder-v3 (~40 GB, requires 24+ GB VRAM).

For aider integration: pip install aider-chat → aider --model ollama/qwen2.5-coder:14b — launches a CLI pair-programming session.

The cheap setup

Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb). Runs Qwen 2.5 Coder 7B at 60-80 tok/s — fast enough for IDE autocomplete. Can run Qwen 2.5 Coder 14B Q4_K_M at 25-35 tok/s with full offload. Pair with Ryzen 5 5600 + 16 GB DDR4 + 512 GB NVMe. Total: ~$360-405. If you need more context for full-repo editing, upgrade to 32 GB system RAM ($50).

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs DeepSeek Coder V3 at 15-20 tok/s via llama.cpp. Qwen 2.5 Coder 32B Q6_K at 35-50 tok/s. Can run aider with full repo context on mid-size codebases. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: ~$1,800-2,200. For the best local coding agent experience, add a second RTX 3090 — 48 GB total runs DeepSeek V3 at 25-35 tok/s.

Common beginner mistake

The mistake: Using a general-purpose chat model (like Llama 3.1 8B) for code generation and wondering why the output has syntax errors. Why it fails: General chat models aren't fine-tuned on code — they hallucinate APIs, mix languages, and miss edge cases. The fix: Use a code-specific model: Qwen 2.5 Coder 7B/14B/32B, DeepSeek Coder V3, or Codestral Mamba 7B. These are trained predominantly on code corpora and dramatically reduce syntax errors and API hallucinations.

Recommended setup for code generation

Recommended hardware

Best GPU for Ollama (coding workflows) →

Code models work great on Ollama; 16 GB minimum.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for Ollama (coding workflows) →

Reality check

Code models are LLM workloads — same VRAM math applies. 16 GB runs 13-32B Q4 (Qwen 2.5 Coder, DeepSeek Coder); 24 GB unlocks 70B-class code models. The killer detail is context window — code review wants 32K+, which pushes KV cache beyond 16 GB on 70B.

Common mistakes

Skipping context-window math (KV cache eats VRAM at scale)
Using base instruct models for code (specialized code models 30-50% better)
Running coding agent loops on 8 GB (works for 7B but agent loops compound)
Forgetting flash-attention impacts code workflows more than chat

What breaks first

The errors most operators hit when running code generation locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle code generation before committing money.

Hardware buying guidance for Code Generation

Local coding workflows live or die on time-to-first-token and 32K+ context. The guides below cover the developer-specific hardware decision.