Coding
bug fixing
debug ai

Debugging

AI-assisted bug diagnosis and fix generation. Reasoning + code understanding + execution tool-use combine here.

Capability notes

AI-assisted debugging uses language models to identify, explain, and fix bugs from error traces, stack traces, and source code. The capability spectrum: **static analysis** (feed error + code → model explains and suggests fix) to **agentic loops** (model reads error → searches codebase → proposes fix → applies fix → runs tests → iterates). **What AI debugging fixes reliably**: (1) Common exceptions — NullPointerException, IndexError, TypeError — pattern-matched against millions of similar bugs. First-hit rate 40-60% for [DeepSeek Coder V3](/models/deepseek-coder-v3). (2) Missing imports and syntax errors — near-100%. (3) Simple logic errors (off-by-one, inverted condition) — 50-70%. (4) Dependency version mismatches — 60-80%, model recognizes "module X has no attribute Y" as version incompatibility. **What AI debugging cannot fix**: (1) Concurrency bugs — race conditions, deadlocks. Static code can't see non-deterministic runtime behavior. Fix rate <20%. (2) Distributed system failures — network partitions, retry storms across services. (3) Heisenbugs — timing-sensitive bugs that disappear under observation. (4) Bugs requiring domain knowledge — wrong formula in physics simulation, incorrect tax calculation. Model doesn't know the correct domain value. **SWE-bench scores** (real-world GitHub issue fixes): [DeepSeek Coder V3](/models/deepseek-coder-v3) + agentic loop solves 30-40% of SWE-bench-verified issues. [Claude 3.7 Sonnet](https://claude.ai) + agentic loop: 50-65%. GPT-5: 55-70%. The open-weight vs API gap on bug-fixing is 15-30 points — wider than on code generation. **Stack trace analysis**: Models excel here because stack traces are structured and information-dense. Stack trace + failing code → model explanation matches human reasoning 70-85% for single-file bugs. Explanation is often indistinguishable from a senior engineer's assessment — correct about which line/variable/condition, sometimes wrong about root cause.

If you just want to try this

Lowest-friction path to a working setup.

Start with offline stack trace analysis using [LM Studio](/tools/lm-studio) — zero integration with your codebase required. When you hit a bug, copy the full error message + stack trace + relevant code files. Paste into LM Studio with [DeepSeek Coder V3](/models/deepseek-coder-v3) loaded. Ask: "What caused this error and how do I fix it?" Include the full traceback chain (not just the last line — earlier frames often contain root cause). [DeepSeek Coder V3](/models/deepseek-coder-v3) at Q4 requires ~20 GB VRAM — [RTX 3090 24GB](/hardware/rtx-3090) or [RTX 4090 24GB](/hardware/rtx-4090) handles it. For less VRAM: [CodeGemma 7B](/models/codegemma-7b) at Q4 (~5 GB) on [RTX 3060 12GB](/hardware/rtx-3060-12gb) — fix rate drops significantly; 7B catches simple bugs, misses multi-line logic errors. [Codestral Mamba 7B](/models/codestral-mamba-7b) is another small option but produces lower-quality explanations than transformer code models. First-hit fix rate on common patterns is high — off-by-one errors, missing null checks, wrong imports. For unfamiliar error messages in unfamiliar libraries, the model's ability to explain what the error means (even with wrong fix) cuts debugging time 30-50%. For multi-file bugs (error in file A, root cause in file B), the offline copy-paste approach breaks down — you need all relevant files, and you don't always know which are relevant. This is when you graduate to agentic tools (see operator section). For privacy-sensitive debugging (crash logs with PII, proprietary code), use fully local [LM Studio](/tools/lm-studio) — no data leaves your machine. [Ollama](/tools/ollama) is equivalent; LM Studio's GUI is slightly simpler for beginners.

For production deployment

Operator-grade recommendation.

Production AI debugging deploys agentic loops that read errors, search codebases, propose and apply fixes, run tests, and iterate — reducing MTTR for common bug categories. **Agentic architecture**: Bug report (error + repro steps) → (1) Plan: read error, search codebase ([Continue.dev](/tools/continue)-style indexing or grep), form hypothesis, propose fix. (2) Execute: apply fix via tree-sitter AST modification (reduces syntax errors 30-50% vs line-based search-replace). (3) Verify: run relevant test suite. Pass → flag for review. Fail → return error to agent for next iteration. **Tools**: [Cline](/tools/cline) in VS Code supports agentic debugging — reads files, proposes edits, runs terminal commands, iterates. Cline with [Claude API](https://claude.ai) achieves highest SWE-bench scores (50-65%). [Aider](/tools/aider) with `--architect` mode: one model plans (reasoning model like [DeepSeek V4](/models/deepseek-v4)), another edits (code model like [DeepSeek Coder V3](/models/deepseek-coder-v3)). Two-model split improves fix quality by reducing hallucinated edits. **When AI reduces MTTR**: (1) Common runtime exceptions — fix rate 60-80%, MTTR reduction 50-70%. (2) API compatibility after dependency upgrade — 40-60% MTTR reduction. (3) Configuration errors — fix rate 70-90%, MTTR reduction 60-80%. (4) Test failures with clear assertions — fix rate 40-60%. **When AI does NOT reduce MTTR**: (1) Novel bugs outside training patterns — fix rate <20%. (2) Bugs needing external knowledge (third-party API changes, hardware-specific issues) — near-zero. (3) Performance bugs — agent cannot profile running systems. **Incident response integration**: Error monitoring (Sentry/Datadog) triggers → agentic debugger receives error + stack trace + recent git diff → produces fix PR → on-call engineer reviews and merges. Handles 20-40% of incidents (simple, pattern-matchable) with zero human intervention between detection and PR. MTTR drops from 30-90 minutes (human) to 5-15 minutes (agent + review). **Safety guardrails**: (1) Fixes must pass full test suite before human review — no auto-merge without passing tests. (2) All AI fixes flagged for mandatory human review — never auto-merge without sign-off. (3) Agent restricted from modifying production configs, database schemas, infrastructure-as-code. (4) Debugging runs in isolated container with read-only codebase access; writes only to feature branch.

What breaks

Failure modes operators see in the wild.

- **Fix that introduces new bug.** The model correctly fixes the reported bug but the change breaks a caller in a different file it didn't read. Most common and dangerous failure. Mitigation: run full test suite, not just the fixed function's test. Human reviewer must specifically check cross-module impact. Limit agent to specific files — don't let it roam making "improvements." - **Hallucinated dependency.** Model suggests importing a library/function that doesn't exist — hallucinates plausible API from patterns of real libraries. Mitigation: run linter/build after agent proposes fix. Import error caught in seconds at lint stage vs minutes at test stage. - **Incorrect root cause from ambiguous stack trace.** Error at line 47 of auth.py, but root cause is null value set at line 120 of config.py earlier in runtime. Model proposes null-check at line 47 (treating symptom) instead of tracing to origin. Mitigation: include full error context — variable values, recent logs, execution path. Agents should request additional context before proposing fixes. - **Language version mismatch.** Model suggests Python 3.12 syntax for a 3.8 codebase. SyntaxError on the fix itself. Mitigation: tell the agent the language version in prompt, or have it check pyproject.toml/package.json for version constraints. - **Security degradation in "fixed" code.** Model removes security check to fix bug — removes file size check to "fix" large-upload bug → system now accepts arbitrarily large files. Mitigation: security linter pre-commit hook (bandit, eslint-plugin-security). Human review must specifically check security-relevant changes. Don't allow agent to modify auth/validation/sanitization code. - **Context window overflow — partial fix.** Bug requires understanding 5 files, agent's context fits only 3. Agent produces fix based on partial understanding that works for 3 files but breaks interactions with the 2 unseen files. Mitigation: use models with 128K+ context. For 32K models, agent should explicitly request additional files when detecting cross-file dependencies.

Hardware guidance

**Hobbyist ($600-$1,500)**: [RTX 3060 12GB](/hardware/rtx-3060-12gb) or [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb). Runs [CodeGemma 7B](/models/codegemma-7b) or [Codestral Mamba 7B](/models/codestral-mamba-7b) for stack trace analysis and simple fixes. Catches common patterns, misses multi-step logic. [MacBook Pro 16 M4 Max 36GB](/hardware/macbook-pro-16-m4-max) runs CodeGemma 7B on [MLX LM](/tools/mlx-lm) for portable debugging. **SMB ($2,000-$4,000)**: [RTX 4090 24GB](/hardware/rtx-4090) or [RTX 5090 32GB](/hardware/rtx-5090). Runs [DeepSeek Coder V3](/models/deepseek-coder-v3) or [Qwen 3 32B](/models/qwen-3-32b) at Q4 — minimum for agentic debugging where model must understand multi-file code. 40-80 tok/s at Q4, sub-second TTFT for simple queries. 5090 32 GB fits [Llama 3.3 70B](/models/llama-3-3-70b) Q4 entirely for stronger architectural debugging. **Enterprise ($8,000-$25,000)**: [RTX A6000](/hardware/rtx-a6000) 48 GB or [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB for team debugging server. Runs [DeepSeek Coder V3](/models/deepseek-coder-v3) Q8 (higher reasoning quality, fewer hallucinated fixes) or [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) Q4 for best-in-class open-weight code reasoning. 5-20 developers simultaneously. **Frontier ($50,000+)**: [NVIDIA H100 PCIe](/hardware/nvidia-h100-pcie) or [MI300X](/hardware/amd-mi300x) for [DeepSeek V4](/models/deepseek-v4) agentic debugging. Frontier reasoning handles multi-file root cause analysis better than 70B-class. ROI only when developer time is $150-300/hour and MTTR reduction saves thousands of hours annually. **Speed vs quality tradeoff**: Interactive debugging needs TTFT under 1 second — use fast GPU ([RTX 4090](/hardware/rtx-4090)). Batch debugging (nightly crash analysis) prioritizes quality — use larger GPU ([L40S](/hardware/nvidia-l40s)). Agentic debugging is token-intensive: 20K-100K tokens per session. Budget 50K tokens average.

Runtime guidance

**If debugging interactively in IDE** → [Cline](/tools/cline) in VS Code. Reads codebase, runs terminal commands, applies edits, iterates. Use [Claude 3.7 Sonnet](https://claude.ai) API for best debugging quality (15-30 points above open-weight on SWE-bench). For local: [vLLM](/tools/vllm) serving [DeepSeek Coder V3](/models/deepseek-coder-v3) or [Qwen 3 32B](/models/qwen-3-32b) as Cline backend. **If preferring CLI-based debugging** → [Aider](/tools/aider) with `--architect` mode. Two-model architecture: reasoning model plans ([DeepSeek V4](/models/deepseek-v4) via API or local), code model edits ([DeepSeek Coder V3](/models/deepseek-coder-v3) local). Aider auto-git-commits each change for easy reversion. Run: `aider --architect --model openrouter/deepseek/deepseek-v4 --editor-model ollama/deepseek-coder-v3`. **If needing offline, privacy-safe debugging** → [LM Studio](/tools/lm-studio) or [Ollama](/tools/ollama) for one-off sessions. Copy error+stack trace+code files → paste → receive explanation and fix suggestion. Manual (non-agentic) approach guarantees zero data leaves machine. Best local: [DeepSeek Coder V3](/models/deepseek-coder-v3) for code bugs, [Llama 3.3 70B](/models/llama-3-3-70b) for architectural bugs. **If building automated debugging pipeline** → Wrap [Aider](/tools/aider) or [Cline](/tools/cline) in CI/CD. On test failure in CI, trigger agent with failing test output + changed files diff. Agent attempts fix, commits to new branch, creates PR for review. Catches 20-40% of CI failures automatically. **Model ranking (SWE-bench verified)**: 1. Claude 3.7 Sonnet (API) — 50-65% 2. GPT-5 (API) — 55-70%, higher cost 3. [DeepSeek V4](/models/deepseek-v4) (API/self-hosted) — 40-55% 4. [Llama 3.3 70B](/models/llama-3-3-70b) (local) — 25-35% 5. [Qwen 3 32B](/models/qwen-3-32b) (local) — 20-30% 6. [DeepSeek Coder V3](/models/deepseek-coder-v3) (local) — 20-30% **Caveat**: SWE-bench scores are for fully autonomous debugging (no human intervention). Human-in-the-loop (agent proposes, human approves) achieves 70-85% fix rate for same bugs. Pure autonomous is for low-risk, high-volume bugs. Human-in-the-loop is for all production bugs.

Setup walkthrough

  1. Install Ollamaollama pull qwen2.5-coder:14b (~9 GB).
  2. Copy your error message + relevant code snippet (the function throwing the error + its callers).
  3. Prompt: "Here is my Python code: [paste code]. I get this error: [paste traceback]. What's wrong and how do I fix it?"
  4. First diagnostic in 5-15 seconds. The model reads the traceback, identifies the bug, and suggests a fix with corrected code.
  5. For systematic debugging (VS Code): install the Continue extension → select the buggy function → Ctrl+Shift+P → "Continue: Debug Selection" → model analyzes the function for logic errors, off-by-ones, type mismatches.
  6. For runtime debugging: pair with pdb/Chrome DevTools — the model suggests what to inspect at each breakpoint.
  7. Best models for debugging: Qwen 2.5 Coder 32B > DeepSeek Coder V3 > Qwen 2.5 Coder 14B > Codestral Mamba 7B.

The cheap setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Qwen 2.5 Coder 14B at 25-35 tok/s — handles most single-function debugging (logic errors, type errors, off-by-one). Pair with Ryzen 5 5600 + 16 GB DDR4 + 512 GB NVMe. Total: ~$360-405. For debugging across multiple files (trace the error through the call chain), the 14B works but needs context management — paste the relevant functions manually. The 7B coding models can debug simple syntax/type errors but fail at logical bugs.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Qwen 2.5 Coder 32B at 35-50 tok/s or DeepSeek Coder V3 at 15-20 tok/s — these models debug multi-file issues, understand async/concurrency bugs, and suggest architectural fixes. Pair with aider or Cline for full-repo debugging: the agent reads the traceback, finds the relevant files, proposes fixes, and runs tests. Total: ~$1,800-2,200. Debugging quality jumps sharply at 32B — the model can reason about "why" not just "what."

Common beginner mistake

The mistake: Pasting an entire 500-line file + a traceback into a 7B model and asking "fix this." Why it fails: 7B models get lost in long contexts — they fixate on the wrong part of the code, suggest changes to lines that aren't broken, or miss the root cause because it's buried 200 lines into the file. The fix: Narrow the context. Paste only: (1) the traceback, (2) the function where the error occurs, (3) the function that called it, (4) any relevant type definitions or imports. ~50-100 lines total. Also: ask for a diagnosis BEFORE asking for a fix — "What is causing this error?" gets better answers than "Fix this" because it forces the model to reason before acting.

Reality check

Code models are LLM workloads — same VRAM math applies. 16 GB runs 13-32B Q4 (Qwen 2.5 Coder, DeepSeek Coder); 24 GB unlocks 70B-class code models. The killer detail is context window — code review wants 32K+, which pushes KV cache beyond 16 GB on 70B.

Common mistakes

  • Skipping context-window math (KV cache eats VRAM at scale)
  • Using base instruct models for code (specialized code models 30-50% better)
  • Running coding agent loops on 8 GB (works for 7B but agent loops compound)
  • Forgetting flash-attention impacts code workflows more than chat

What breaks first

The errors most operators hit when running debugging locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle debugging before committing money.

Specialized buyer guides
Updated 2026 roundup