Coding
codebase chat
repo qa
ai code search

Repository Chat

Conversational interface over an entire codebase. Combines RAG over code + long-context model + tool-use for navigation.

Capability notes

Repo-chat tools let you ask natural-language questions about a codebase — "where is authentication logic," "how does the payment flow work" — receiving answers grounded in your actual code via RAG over the repository. Quality depends on three factors: **indexing** (does retrieval find all relevant code), **context assembly** (does the assembled context contain enough to answer), and **model capability** (can the LLM reason about retrieved code). **What repo-chat does well**: (1) Locate functionality — "where is the rate limiter" returns exact file/function. (2) Explain code — "explain the caching layer" line-by-line. (3) Trace execution paths — "trace request lifecycle from router to database." (4) How-do-I questions — "add a new API endpoint" based on existing patterns. (5) Type-aware code generation using project's actual types. **What repo-chat cannot do**: (1) Understand unwritten conventions — "never import directly from utils/." (2) Navigate highly abstracted code — deep inheritance chains, metaprogramming, dependency injection are invisible to static analysis. (3) Assess performance — "is this query efficient" can't discern query plans. (4) Understand intent — "why was this written this way" requires git history and design docs. **Type-aware retrieval** is the key differentiator. [Continue.dev](/tools/continue) uses tree-sitter for AST parsing (function boundaries, class hierarchies, import graphs). [Cursor](/tools/cursor) uses proprietary code-optimized embeddings capturing semantic similarity better than general-purpose embeddings ([BGE-M3](/models/bge-m3) does code retrieval but isn't optimized for it). Sourcegraph Cody uses SCIP for precise cross-repository symbol resolution. **Answer quality vs manual reading**: For straightforward questions ("where is this defined"), repo-chat is faster (seconds vs minutes). For complex questions ("is there a race condition here"), manual reading remains more reliable — the LLM misses cross-file interactions and non-local effects that human experience catches. Repo-chat is a search/explanation accelerator, not a replacement for understanding.

If you just want to try this

Lowest-friction path to a working setup.

Install [Continue.dev](/tools/continue) as a VS Code or JetBrains extension. Free, open-source, 15-minute setup. After installation, open your project. Continue indexes your codebase in background — 2-5 minutes for a 50,000-line project on a modern laptop. Once indexing completes, open Continue chat (Ctrl+L / Cmd+L) and type `@codebase` followed by your question. Continue uses [BGE-M3](/models/bge-m3) embeddings by default for code indexing, combined with tree-sitter AST parsing. The answering model is your chosen LLM — configure via [Ollama](/tools/ollama), [LM Studio](/tools/lm-studio), or any API. For best local setup: Install [Ollama](/tools/ollama), pull [DeepSeek Coder V3](/models/deepseek-coder-v3) (`ollama pull deepseek-coder-v3`) or [CodeGemma 7B](/models/codegemma-7b) if less VRAM. Set model to `ollama/deepseek-coder-v3` in Continue config. Local model handles code explanation and simple generation well; complex multi-file reasoning benefits from API models (Claude). Expect accurate first few answers, then diminishing returns on architectural questions. This is expected — repo-chat is a force multiplier for navigation, not an oracle. Use it to find things faster; read code yourself to understand deeply. For privacy-sensitive codebases: Continue.dev + [Ollama](/tools/ollama) with [DeepSeek Coder V3](/models/deepseek-coder-v3) or [Qwen 3 32B](/models/qwen-3-32b) fully local. No code leaves your machine. For strongest local answers, [Llama 3.3 70B](/models/llama-3-3-70b) on [RTX 4090](/hardware/rtx-4090) outperforms 32B code models on architectural reasoning.

For production deployment

Operator-grade recommendation.

Production codebase AI requires indexing freshness, permission scoping, and security — problems absent in single-developer use. **On-premises indexing**: For proprietary code that cannot leave the network, deploy self-hosted pipeline. [Continue.dev](/tools/continue) supports local-only with [Ollama](/tools/ollama) backend and local [BGE-M3](/models/bge-m3) embeddings. Index lives on local disk — no cloud sync. For 10-100 developers: shared indexing server (one machine continuously indexing via git webhook, incremental re-index of changed files only), developers connect Continue instances to shared server. Architecture: shared index → IDE instances query only. **Indexing freshness**: Active repos change every few hours. Stale indexes reference deleted functions and suggest refactored-away patterns. Implement continuous re-index: git webhook on push → incremental re-index (changed files only, 1-5 seconds per commit) → update vector store → notify IDE instances. Full re-index for 100K-file monorepo: 10-30 minutes. Incremental for single commit: 1-5 seconds. **Security implications**: Indexing proprietary code creates a searchable database of your codebase — hardcoded secrets, vulnerability patterns, business logic. Mitigations: (1) Secrets scanning before indexing (truffleHog). (2) Access control on index server (same as source code repo). (3) Audit logging on all queries. (4) Encryption at rest for vector DB. (5) Never send proprietary code to third-party LLM APIs. **Permission-aware retrieval**: Different team members see different code (contractors: UI only, backend engineers: backend only). Tag code chunks with file path and access group; filter results per-user. [pgvector](/tools/pgvector) with row-level security provides straightforward implementation. **Model selection**: [DeepSeek Coder V3](/models/deepseek-coder-v3) (~33B active MoE) — best code-specialized open-weight, strongest on code explanation. [Qwen 3 32B](/models/qwen-3-32b) — best balance of code + general reasoning. [Llama 3.3 70B](/models/llama-3-3-70b) — strongest architectural reasoning, weakest specific code knowledge. Pair code model for "what does this do" with reasoning model for "why is this designed this way."

What breaks

Failure modes operators see in the wild.

- **Stale index (references deleted code).** Index from 3 days ago references `CacheManager` that was renamed to `CacheService` yesterday. Answers are syntactically correct but practically useless — mention non-existent files, suggest recent-changed patterns. Mitigation: incremental re-index on every push. For critical deployments: just-in-time per-file indexing before answering, adds 1-3s latency but guarantees freshness. - **Cross-file dependency blindness.** Retrieval returns the file where a function is defined but misses 3 files that import and extend it. LLM sees function in isolation — misses subclass override, decorator behavior change, caller-enforced invariants. Mitigation: type-aware retrieval following import graphs and class hierarchies. [Continue.dev](/tools/continue) tree-sitter tracks basic imports; complex graphs need LSP integration. - **Type information loss.** Embedding-based retrieval indexes text chunks. LLM receives function text but not its type signature context, parent class, or implemented interface. Model misinterprets parameter type because the type definition is in a different unretrieved chunk. Mitigation: enrich chunks with type annotations at indexing time from LSP/type checker. - **Large repo performance degradation.** Monorepo 500K files → 2-5M code chunks → retrieval latency grows from <100ms to 2-10s. Developers stop using the tool. Mitigation: hierarchical indexing (file-level first, then function-level), ANN indexes with quantization, index partitioning by module. - **Language-specific parser failures.** Rust macros (token transformations tree-sitter can't parse), C++ templates (instantiation creates code absent from source), Python decorators (runtime behavior invisible to static analysis). Retrieval sees incomplete structure. Mitigation: supplement tree-sitter with LSP-based indexing — LSP understands code as compiler sees it, after macro expansion and template instantiation.

Hardware guidance

Repo-chat splits into **indexing** (CPU + RAM + storage bound) and **inference** (GPU bound). These scale independently. **Hobbyist (single dev, <50K lines)**: Any modern laptop 16+ GB RAM with [RTX 3060 12GB](/hardware/rtx-3060-12gb) or better. Indexing: 2-5 minutes. Inference: [CodeGemma 7B](/models/codegemma-7b) or [DeepSeek Coder V3](/models/deepseek-coder-v3) on 12 GB. [MacBook Pro 16 M4 Max 36GB](/hardware/macbook-pro-16-m4-max) handles indexing + inference on one machine — unified memory means no VRAM/RAM split. [Snapdragon X Elite](/hardware/snapdragon-x-elite) 32 GB runs CodeGemma 7B on CPU at 10-20 tok/s. **SMB (small team, 50K-500K lines)**: Dedicated indexing server: 32+ GB RAM, NVMe (2+ GB/s), 8-16 cores. Indexing 500K lines: 10-30 minutes. Inference: [RTX 4090 24GB](/hardware/rtx-4090) runs [Qwen 3 32B](/models/qwen-3-32b) or [DeepSeek Coder V3](/models/deepseek-coder-v3) — 5-10 concurrent queries at <10s latency. **Enterprise (10-100 devs, 500K-5M lines)**: Indexing server: 64-128 GB RAM, NVMe RAID (5+ GB/s), 16-32 cores. Inference: [RTX A6000](/hardware/rtx-a6000) 48 GB or [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB runs [Llama 3.3 70B](/models/llama-3-3-70b) Q4 for strongest reasoning. 2× L40S handles 20-50 concurrent queries. Separate indexing and inference — indexing is bursty (CPU spikes on push), inference is continuous. **Frontier (100+ devs, 5M+ lines)**: Multi-node indexing with [pgvector](/tools/pgvector) sharded by repo/module. Inference: [NVIDIA H100 PCIe](/hardware/nvidia-h100-pcie) for [DeepSeek V4](/models/deepseek-v4) or [Qwen 3 235B](/models/qwen-3-235b-a22b). Multi-GPU [vLLM](/tools/vllm) for 100+ concurrent queries. **Storage**: 100K-line codebase → ~500K-1M chunks × 4 KB each (1024-dim FP32) = 2-4 GB index. 10M-line codebase → 200-400 GB. ANN indexes add 20-50% overhead.

Runtime guidance

**If individual developer wanting IDE codebase Q&A** → [Continue.dev](/tools/continue) as VS Code/JetBrains extension. Most mature open-source IDE code-chat. Configure local via [Ollama](/tools/ollama) for privacy, or API for quality. `@codebase` context provider indexes project and retrieves relevant code. Also supports `@file`, `@folder`, `@docs`, custom providers. **If wanting highest-quality answers from code model** → [Cursor](/tools/cursor) with proprietary code-optimized embeddings + Claude/GPT backend. Industry-leading retrieval quality. VS Code fork (same UX). Tradeoff: codebase indexing sends embeddings to Cursor servers — privacy consideration for proprietary code. **If needing on-premises code AI for team** → [Continue.dev](/tools/continue) + shared indexing server + [Ollama](/tools/ollama) or [vLLM](/tools/vllm) backend. Configure remote embedding server and LLM endpoint. All within your network — no code leaves. **If needing cross-repository intelligence** → [Sourcegraph Cody](https://sourcegraph.com) (hosted or self-hosted). SCIP index provides precise cross-repo code navigation — knows `User` type in repo A is the same as `User` imported by repo B. For organizations with internal package ecosystems. Self-hosted: $0-19/user/month. **If building custom repo-chat** → [LlamaIndex](https://www.llamaindex.ai/) or [LangChain](https://www.langchain.com/) with tree-sitter-based code chunking. [BGE-M3](/models/bge-m3) embeddings. Store in [pgvector](/tools/pgvector) or [Qdrant](/tools/qdrant). [vLLM](/tools/vllm) for answering LLM. 1-2 months engineering for production quality. **Quick comparison**: Continue.dev (free, local, good) for privacy/cost. Cursor ($20/mo, best quality) for answer quality. Sourcegraph Cody (self-hosted option, best multi-repo). Custom (full control, $5K-50K+ engineering) for enterprise compliance. **Model selection**: [DeepSeek Coder V3](/models/deepseek-coder-v3) — best code-specialized. [CodeGemma 7B](/models/codegemma-7b) — best small, fits 12 GB. [Qwen 3 32B](/models/qwen-3-32b) — best code+reasoning balance. [Llama 3.3 70B](/models/llama-3-3-70b) — best architectural reasoning.

Setup walkthrough

  1. Install Ollamaollama pull qwen2.5-coder:14b.
  2. Install repo-chat tooling — two proven options:

Option A — Aider (full-featured): pip install aider-chatcd /path/to/repoaider --model ollama_chat/qwen2.5-coder:14b. Aider reads the entire repo map (file tree + symbols) into context. Ask: "Explain how authentication works in this codebase." Aider reads relevant files and explains.

Option B — RepoPrompt (light): pip install repopromptrepochat /path/to/repo → interactive Q&A over the repo.

  1. For VS Code: install the Continue extension → configure Ollama → use @codebase to chat about the entire repo.
  2. First useful answer in 5-15 seconds. Quality depends on repo size — under 50K lines works great; over 500K lines needs chunking.

The cheap setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Qwen 2.5 Coder 14B Q4_K_M at 25-35 tok/s — handles repos up to 100K lines with aider's repo map. Embedding the repo (for RAG-style retrieval) takes 1-5 minutes. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe. Total: ~$400-480. For repos >200K lines, the 14B model's context window (32K-128K tokens) becomes the bottleneck, not VRAM.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Qwen 2.5 Coder 32B Q6_K at 35-50 tok/s — handles repos up to 500K lines with full context. DeepSeek Coder V3 at 15-20 tok/s for the strongest code understanding. For monorepos (1M+ lines): pair with a dedicated embedding model + vector DB (ChromaDB, LanceDB) to index the repo once, then retrieve relevant files per query. Total: ~$1,800-2,200. Dual RTX 3090 runs DeepSeek V3 faster and handles larger context.

Common beginner mistake

The mistake: Opening a repo in aider/Continue with a 7B model, asking "How does the payment system work?" and getting a hallucinated answer that references files that don't exist. Why it fails: 7B models lack the reasoning depth to navigate complex file relationships. They see the repo map (file list + symbols) but can't infer data flow or architectural patterns from it. The fix: Minimum viable for repo-chat is 14B. At 14B, models start understanding file relationships and can trace call chains across files. For production repo-chat, use 32B+ coding models. Also: give the model specific prompts — "Trace the authentication flow starting from login.ts" works better than "How does auth work?" because it provides a starting point the model can follow.

Reality check

Code models are LLM workloads — same VRAM math applies. 16 GB runs 13-32B Q4 (Qwen 2.5 Coder, DeepSeek Coder); 24 GB unlocks 70B-class code models. The killer detail is context window — code review wants 32K+, which pushes KV cache beyond 16 GB on 70B.

Common mistakes

  • Skipping context-window math (KV cache eats VRAM at scale)
  • Using base instruct models for code (specialized code models 30-50% better)
  • Running coding agent loops on 8 GB (works for 7B but agent loops compound)
  • Forgetting flash-attention impacts code workflows more than chat

What breaks first

The errors most operators hit when running repository chat locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle repository chat before committing money.

Related tasks

Specialized buyer guides
Updated 2026 roundup