Claude Code with local models.
Anthropic's Claude Code is a cloud-first terminal agent — but in 2026 there are two clean paths to point it at a local backend instead: Ollama's native Anthropic- compatible Messages endpoint (shipped Jan 16, 2026) and the LiteLLM proxy translation layer. Here's how to set up each, which local models hold up, and when the native-local alternatives (Aider, Cline) are the cleaner path.
TL;DR
Two verified paths in May 2026:
- Path A — direct. Ollama added Anthropic Messages API compatibility on Jan 16, 2026. Set
ANTHROPIC_BASE_URL=http://localhost:11434+ANTHROPIC_AUTH_TOKEN=ollama. Claude Code talks to Ollama as if Ollama were Anthropic. No proxy needed. - Path B — LiteLLM proxy. For backends that don't speak the Anthropic wire protocol (vLLM, llama.cpp server, LM Studio), put LiteLLM in front. LiteLLM accepts Anthropic-format requests and routes them to whatever downstream you configure.
Important tradeoff: Claude Code is heavily tuned for Claude. Local 32B coders handle simple edits well; complex multi- step planning behaves notably worse than the cloud default. The native-local-first alternatives (Aider, Cline) often fit the local-hardware envelope better.
Editorial stance
RunLocalAI is brand-agnostic. We don't earn referral fees from Anthropic, Ollama, LiteLLM, or any tool covered here. This page documents how to route a cloud-default coding agent at a local backend — not a recommendation to keep using Claude Code if a natively-local tool fits your situation better. § 9 names the alternatives plainly.
See /how-we-make-money.
What Claude Code is
Claude Code is Anthropic's terminal coding agent — installed via npm install -g @anthropic-ai/claude-code or the official installer. It reads your repo, plans changes, edits files, runs tests, and iterates. Default wire protocol is Anthropic's Messages API; default backend is the Claude family in the cloud.
Why local matters: same reasons as any cloud-default coding tool — privacy, cost, offline, and familiar UX with a different backend. Two surfaces matter for the swap: where Claude Code sends requests (controlled by ANTHROPIC_BASE_URL) and how the responses are shaped (Anthropic's Messages format).
Two paths to local
The two paths differ by where the wire-protocol translation happens:
- Path A — backend speaks Anthropic natively. Ollama since Jan 2026 exposes an Anthropic-compatible Messages endpoint at the standard port. No proxy. Smallest moving parts.
- Path B — proxy translates. LiteLLM accepts Anthropic-format requests on one side, calls any model (OpenAI, local-OpenAI-compat, etc.) on the other. Use when you want vLLM / llama.cpp / LM Studio behind Claude Code.
Pick A when Ollama is your runtime. Pick B for anything else.
Path A — Ollama Anthropic-compatible endpoint
# 1. Ollama 0.5+ (Anthropic-compat shipped Jan 16 2026) ollama serve & ollama pull qwen2.5-coder:32b # 2. Point Claude Code at local instead of api.anthropic.com export ANTHROPIC_BASE_URL="http://localhost:11434" export ANTHROPIC_AUTH_TOKEN="ollama" # any non-empty value export ANTHROPIC_API_KEY="" # explicitly empty so the SDK doesn't preempt # 3. (Optional) pin the model Claude Code asks for export ANTHROPIC_MODEL="qwen2.5-coder:32b" # 4. Run normally cd ~/your-repo claude
That's the whole setup. Claude Code thinks it's calling Anthropic; Ollama answers in the matching wire format. The biggest gotcha is the empty ANTHROPIC_API_KEY — if your shell exports a real one (from previous Claude Code use), the SDK preempts the base URL.
Path B — LiteLLM proxy in front
For vLLM / llama.cpp-server / LM Studio backends, run LiteLLM as a translation layer:
# 1. Install LiteLLM
pip install 'litellm[proxy]'
# 2. Write litellm-config.yaml — example for vLLM backend
cat > litellm-config.yaml <<'YAML'
model_list:
- model_name: claude-3-5-sonnet-20241022 # what Claude Code asks for
litellm_params:
model: openai/qwen2.5-coder-32b-instruct # what LiteLLM forwards
api_base: http://localhost:8000/v1 # your vLLM endpoint
api_key: sk-local
YAML
# 3. Start the proxy
litellm --config litellm-config.yaml --port 4000
# 4. Point Claude Code at LiteLLM
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="sk-anything" # LiteLLM master key
# 5. Run
claudeThe same shape works for llama.cpp's OpenAI- compatible server (./server -m model.gguf) and LM Studio (built-in OpenAI-compat). Adjust the api_base + model in the config to match.
Claude Code asks for specific Anthropic model names (claude-3-5-sonnet-20241022, etc.). LiteLLM'smodel_name field is the alias Claude Code sees;litellm_params.model is what actually gets called. Map all the Anthropic model names your Claude Code version requests, or use model_group_alias to catch-all.
Which local models fit
The Claude Code prompt + tool-call schema is tuned for Claude-family behavior. 2026 picks that handle it reliably:
Note on the Hermes line: “specifically tuned for tool-use loops” reflects Nous Research's published function-calling fine-tunes plus observed uptake across community recipes on r/LocalLLaMA + Ollama threads. We don't hold an audited usage count; treat the framing as a popular default rather than a measured winner.
| Model | VRAM | Claude-Code fit |
|---|---|---|
| Qwen 2.5 Coder 32B | 24GB | Best 32B-class coder; clean tool-call schema |
| Hermes 3 8B | 12GB | Specifically tuned for tool-use loops — most robust small-class pick when tool-call reliability matters more than reasoning depth |
| Hermes 4 70B | 48GB+ | Strongest local pick for agentic loops when you have the headroom — tool-use is the differentiator |
| Llama 3.3 70B Instruct | 48GB+ | Solid all-rounder; weaker than dedicated coders on diff patterns |
| DeepSeek Coder V2 16B | 16GB | Best 16GB option; less reliable on multi-step plans |
| Qwen 3 Coder 32B | 24GB | Newer Qwen-line coder; pick whichever Ollama serves cleanest |
Models without strong tool-use training (older base models, generic Llamas without Hermes-style finetuning) will hallucinate JSON tool-call structure. Claude Code will then surface confused error states. Stick to the models above.
Limits + when to keep using cloud
- Frontier reasoning gap. Claude Sonnet 4+ on a complex refactor still outperforms any local 32B coder. If your daily work is cross-codebase architectural moves, local backends will frustrate you.
- 200K+ context. Claude's 200K cloud context is comfortably beyond local VRAM ceilings. Local 32K is the comfortable limit; long-context Claude Code prompts will truncate.
- Tool-call drift between releases. Claude Code updates its prompts and tool schemas regularly. A new release can break a working local config until the model catches up (or you update Ollama / LiteLLM mappings).
- Slot pricing. Cloud Claude Code on a paid Claude.ai subscription has generous limits at fixed cost. Local trades that for hardware capex + electricity — see /cost-calculator for the math.
Keep using cloud when: you need frontier reasoning, you work on 100K+ context regularly, your team policy allows cloud, your monthly Claude Code spend is under what comparable local hardware costs to amortize.
Natively-local alternatives
Tools built local-first usually feel less friction-y than a cloud-tool routed through a translation layer:
- Aider → — terminal-driven, git-aware, OpenAI-compatible from day one. The lowest-friction path on local hardware.
- Cline → — VS Code-integrated. Multi-mode personas; project-level rules; great with Hermes/Qwen Coder.
- Roo Code → — faster-moving Cline fork. Architect / Code / Ask modes pre-built.
- Continue → — VS Code / JetBrains extension; lighter than Cline, good for autocomplete-heavy workflows.
Same pattern, OpenAI's side. Routing via OPENAI_API_BASE.
Same pattern, Google's side. Via GOOGLE_GEMINI_BASE_URL.
Get an Ollama backend running first, then point any CLI at it.
Every local coding tool we track + when each fits.