BLK · CLAUDE CODE · LOCAL BACKEND

Claude Code with local models.

Anthropic's Claude Code is a cloud-first terminal agent — but in 2026 there are two clean paths to point it at a local backend instead: Ollama's native Anthropic- compatible Messages endpoint (shipped Jan 16, 2026) and the LiteLLM proxy translation layer. Here's how to set up each, which local models hold up, and when the native-local alternatives (Aider, Cline) are the cleaner path.

Published 2026-05-13Reviewed May 2026

§ 01

TL;DR

Two verified paths in May 2026:

Path A — direct. Ollama added Anthropic Messages API compatibility on Jan 16, 2026. Set ANTHROPIC_BASE_URL=http://localhost:11434 + ANTHROPIC_AUTH_TOKEN=ollama. Claude Code talks to Ollama as if Ollama were Anthropic. No proxy needed.
Path B — LiteLLM proxy. For backends that don't speak the Anthropic wire protocol (vLLM, llama.cpp server, LM Studio), put LiteLLM in front. LiteLLM accepts Anthropic-format requests and routes them to whatever downstream you configure.

Important tradeoff: Claude Code is heavily tuned for Claude. Local 32B coders handle simple edits well; complex multi- step planning behaves notably worse than the cloud default. The native-local-first alternatives (Aider, Cline) often fit the local-hardware envelope better.

§ 02

Editorial stance

RunLocalAI is brand-agnostic. We don't earn referral fees from Anthropic, Ollama, LiteLLM, or any tool covered here. This page documents how to route a cloud-default coding agent at a local backend — not a recommendation to keep using Claude Code if a natively-local tool fits your situation better. § 9 names the alternatives plainly.

See /how-we-make-money.

§ 03

What Claude Code is

Claude Code is Anthropic's terminal coding agent — installed via npm install -g @anthropic-ai/claude-code or the official installer. It reads your repo, plans changes, edits files, runs tests, and iterates. Default wire protocol is Anthropic's Messages API; default backend is the Claude family in the cloud.

Why local matters: same reasons as any cloud-default coding tool — privacy, cost, offline, and familiar UX with a different backend. Two surfaces matter for the swap: where Claude Code sends requests (controlled by ANTHROPIC_BASE_URL) and how the responses are shaped (Anthropic's Messages format).

§ 04

Two paths to local

The two paths differ by where the wire-protocol translation happens:

Path A — backend speaks Anthropic natively. Ollama since Jan 2026 exposes an Anthropic-compatible Messages endpoint at the standard port. No proxy. Smallest moving parts.
Path B — proxy translates. LiteLLM accepts Anthropic-format requests on one side, calls any model (OpenAI, local-OpenAI-compat, etc.) on the other. Use when you want vLLM / llama.cpp / LM Studio behind Claude Code.

Pick A when Ollama is your runtime. Pick B for anything else.

§ 05

Path A — Ollama Anthropic-compatible endpoint

# 1. Ollama 0.5+ (Anthropic-compat shipped Jan 16 2026)
ollama serve &
ollama pull qwen2.5-coder:32b

# 2. Point Claude Code at local instead of api.anthropic.com
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"   # any non-empty value
export ANTHROPIC_API_KEY=""            # explicitly empty so the SDK doesn't preempt

# 3. (Optional) pin the model Claude Code asks for
export ANTHROPIC_MODEL="qwen2.5-coder:32b"

# 4. Run normally
cd ~/your-repo
claude

That's the whole setup. Claude Code thinks it's calling Anthropic; Ollama answers in the matching wire format. The biggest gotcha is the empty ANTHROPIC_API_KEY — if your shell exports a real one (from previous Claude Code use), the SDK preempts the base URL.

§ 06

Path B — LiteLLM proxy in front

For vLLM / llama.cpp-server / LM Studio backends, run LiteLLM as a translation layer:

# 1. Install LiteLLM
pip install 'litellm[proxy]'

# 2. Write litellm-config.yaml — example for vLLM backend
cat > litellm-config.yaml <<'YAML'
model_list:
  - model_name: claude-3-5-sonnet-20241022       # what Claude Code asks for
    litellm_params:
      model: openai/qwen2.5-coder-32b-instruct   # what LiteLLM forwards
      api_base: http://localhost:8000/v1         # your vLLM endpoint
      api_key: sk-local
YAML

# 3. Start the proxy
litellm --config litellm-config.yaml --port 4000

# 4. Point Claude Code at LiteLLM
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="sk-anything"   # LiteLLM master key

# 5. Run
claude

The same shape works for llama.cpp's OpenAI- compatible server (./server -m model.gguf) and LM Studio (built-in OpenAI-compat). Adjust the api_base + model in the config to match.

MAPPING NOTE

Claude Code asks for specific Anthropic model names (claude-3-5-sonnet-20241022, etc.). LiteLLM'smodel_name field is the alias Claude Code sees;litellm_params.model is what actually gets called. Map all the Anthropic model names your Claude Code version requests, or use model_group_alias to catch-all.

§ 07

Which local models fit

The Claude Code prompt + tool-call schema is tuned for Claude-family behavior. 2026 picks that handle it reliably:

Note on the Hermes line: “specifically tuned for tool-use loops” reflects Nous Research's published function-calling fine-tunes plus observed uptake across community recipes on r/LocalLLaMA + Ollama threads. We don't hold an audited usage count; treat the framing as a popular default rather than a measured winner.

Model	VRAM	Claude-Code fit
Qwen 2.5 Coder 32B	24GB	Best 32B-class coder; clean tool-call schema
Hermes 3 8B	12GB	Specifically tuned for tool-use loops — most robust small-class pick when tool-call reliability matters more than reasoning depth
Hermes 4 70B	48GB+	Strongest local pick for agentic loops when you have the headroom — tool-use is the differentiator
Llama 3.3 70B Instruct	48GB+	Solid all-rounder; weaker than dedicated coders on diff patterns
DeepSeek Coder V2 16B	16GB	Best 16GB option; less reliable on multi-step plans
Qwen 3 Coder 32B	24GB	Newer Qwen-line coder; pick whichever Ollama serves cleanest

Models without strong tool-use training (older base models, generic Llamas without Hermes-style finetuning) will hallucinate JSON tool-call structure. Claude Code will then surface confused error states. Stick to the models above.

§ 08

Limits + when to keep using cloud

Frontier reasoning gap. Claude Sonnet 4+ on a complex refactor still outperforms any local 32B coder. If your daily work is cross-codebase architectural moves, local backends will frustrate you.
200K+ context. Claude's 200K cloud context is comfortably beyond local VRAM ceilings. Local 32K is the comfortable limit; long-context Claude Code prompts will truncate.
Tool-call drift between releases. Claude Code updates its prompts and tool schemas regularly. A new release can break a working local config until the model catches up (or you update Ollama / LiteLLM mappings).
Slot pricing. Cloud Claude Code on a paid Claude.ai subscription has generous limits at fixed cost. Local trades that for hardware capex + electricity — see /cost-calculator for the math.

Keep using cloud when: you need frontier reasoning, you work on 100K+ context regularly, your team policy allows cloud, your monthly Claude Code spend is under what comparable local hardware costs to amortize.

§ 09

Natively-local alternatives

Tools built local-first usually feel less friction-y than a cloud-tool routed through a translation layer:

Aider → — terminal-driven, git-aware, OpenAI-compatible from day one. The lowest-friction path on local hardware.
Cline → — VS Code-integrated. Multi-mode personas; project-level rules; great with Hermes/Qwen Coder.
Roo Code → — faster-moving Cline fork. Architect / Code / Ask modes pre-built.
Continue → — VS Code / JetBrains extension; lighter than Cline, good for autocomplete-heavy workflows.

SOURCES

Codex CLI with local models →

Same pattern, OpenAI's side. Routing via OPENAI_API_BASE.

Gemini CLI with local models →

Same pattern, Google's side. Via GOOGLE_GEMINI_BASE_URL.

/quickstart →

Get an Ollama backend running first, then point any CLI at it.

Coding agents map →

Every local coding tool we track + when each fits.