AI News HubLIVE
站内改写6 min read

Pairing Claude Code with Local Models

Local models in 2026 are good enough. For the tasks Claude Code handles daily: code completion, refactoring, debugging, codebase explanation; a well-chosen quantized model running locally covers the vast majority of real use cases at zero per-token cost and with no rate limits.

SourceKDnuggetsAuthor: Shittu Olumide

--> Pairing Claude Code with Local Models - KDnuggets

-->

Join Newsletter

Introduction

Agentic coding sessions are expensive. A single Claude Code session — reading files, writing code, running tests, iterating — can burn 10–50x more tokens than a plain chat conversation. At scale, that adds up fast. Add rate limits that can interrupt a long-running workflow mid-session, and the dependency on a third-party API that can change pricing, enforce stricter policies, or go down at any point, and the case for local inference becomes straightforward.

Local models in 2026 are good enough. For the tasks Claude Code handles daily — code completion, refactoring, debugging, codebase explanation — a well-chosen quantized model running locally covers the vast majority of real use cases at zero per-token cost and with no rate limits. This article covers three inference backends (Ollama, LM Studio, and llama.cpp), the exact environment variables and configuration files to wire each one to Claude Code, a curated table of models worth running, and the troubleshooting fixes for the issues you will actually hit.

How Claude Code Connects to Any Local Model

The mechanism is simpler than most guides make it look. Claude Code sends requests in the Anthropic Messages API format. By default those requests go to Anthropic's servers. Setting ANTHROPIC_BASE_URL redirects them to any server that speaks the same format, which now includes Ollama, LM Studio, and llama.cpp natively.

According to the official Claude Code environment variables documentation, the variables that matter for this setup are:

ANTHROPIC_BASE_URL: redirects all API calls from Anthropic's servers to whatever URL you set. Set this to your local inference server address.

ANTHROPIC_API_KEY: the API key sent in the request header. Local servers typically ignore authentication, so this is usually set to a placeholder string like "local" or "ollama."

ANTHROPIC_AUTH_TOKEN: an alternative auth header. Some local servers check for this instead of the API key. Set it to the same placeholder.

ANTHROPIC_DEFAULT_SONNET_MODEL, ANTHROPIC_DEFAULT_HAIKU_MODEL, and ANTHROPIC_DEFAULT_OPUS_MODEL: Claude Code internally requests different model tiers depending on the task. These three variables map each tier to your local model's name. Without them, Claude Code sends requests for claude-sonnet-4-20250514 to your local server, which will reject the request because no such model exists locally.

In January 2026, Ollama added native support for the Anthropic Messages API, which was the technical change that made this workflow practical without translation proxies. LM Studio added a native /v1/messages endpoint in version 0.4.1. llama.cpp has had direct Anthropic API support for longer. All three now speak Claude Code's native protocol.

A clean architecture diagram showing Claude Code, Ollama, LM Studio, and llama.cpp | Image by Author

Backend 1: Ollama

Ollama is the right starting point. It handles all the complexity of model management — downloading weights, quantization, GPU and CPU allocation, and serving — behind a simple command-line interface (CLI). One command to install, one command to pull a model, a few environment variables to configure. It runs as a background service after install, so there is no manual server start required.

Prerequisites

macOS, Linux, or Windows (WSL2 recommended on Windows)

At least 16 GB RAM for practical use (32 GB recommended)

GPU with 8+ GB VRAM for GPU inference, or CPU-only with enough RAM

Ollama v0.14.0 or later required for Anthropic Messages API support

Install Ollama:

macOS and Linux -- one command install

curl -fsSL https://ollama.com/install.sh | sh

Verify the version -- must be 0.14.0+ for Claude Code compatibility

ollama version

Expected: ollama version is 0.14.x or higher

Windows: download the installer from https://ollama.com

Native Windows support has improved significantly in recent releases

After installation, Ollama starts automatically as a background service on port 11434. You can verify it is running:

Check the Ollama server is live

curl http://localhost:11434

Expected response:

Ollama is running

Pull a coding model:

GLM-4.7-Flash -- recommended starting point

Strong tool calling, 128K context, fits on 8 GB VRAM

Apache 2.0 license

ollama pull glm-4.7-flash:latest

Qwen3-Coder -- strong code generation and instruction following

Requires 20+ GB VRAM for the full model

ollama pull qwen3-coder

Devstral-Small -- specifically designed for agentic coding workflows

Community-tested for Claude Code compatibility

24B, requires 16+ GB VRAM

ollama pull devstral-small-2:24b

Verify the model is downloaded and ready

ollama list

Shows all pulled models with their sizes and modification dates

// Configuring Claude Code to Use Ollama

Option 1: Shell export (current terminal session only)

Redirect Claude Code to your local Ollama server

export ANTHROPIC_BASE_URL="http://localhost:11434"

Local servers do not require real authentication

Set these to any non-empty string -- Ollama ignores the value

export ANTHROPIC_API_KEY="ollama" export ANTHROPIC_AUTH_TOKEN="ollama"

Map Claude Code's model tier requests to your local model name

Claude Code internally requests sonnet/haiku/opus -- these variables

translate those tier names to whatever model you have pulled locally

export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7-flash:latest" export ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7-flash:latest" export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-4.7-flash:latest"

Launch Claude Code -- it will now use Ollama instead of the Anthropic API

claude

Option 2: ~/.claude/settings.json (permanent, applies to all sessions)

This approach survives terminal restarts and applies every time you launch Claude Code. Claude Code reads environment variables from settings.json at startup so they take effect no matter how claude was launched.

Create or edit ~/.claude/settings.json:

{ "env": { "ANTHROPIC_BASE_URL": "http://localhost:11434", "ANTHROPIC_API_KEY": "ollama", "ANTHROPIC_AUTH_TOKEN": "ollama", "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-4.7-flash:latest", "ANTHROPIC_DEFAULT_HAIKU_MODEL": "glm-4.7-flash:latest", "ANTHROPIC_DEFAULT_OPUS_MODEL": "glm-4.7-flash:latest" } }

Option 3: .env file in project directory (per-project override)

If you want a specific project to use a different model while keeping your global settings on the Anthropic API:

.env in your project root -- loaded automatically by Claude Code

ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY=ollama ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_DEFAULT_SONNET_MODEL=qwen3-coder ANTHROPIC_DEFAULT_HAIKU_MODEL=qwen3-coder ANTHROPIC_DEFAULT_OPUS_MODEL=qwen3-coder

Verify the connection:

Launch Claude Code with a simple test

claude

Inside Claude Code, run a basic prompt:

> What model are you running?

A local model should respond without making any Anthropic API calls.

To confirm no external calls are being made, run with verbose logging:

claude --verbose

Look for lines showing requests going to localhost:11434

rather than api.anthropic.com

Full working sequence from scratch:

curl -fsSL https://ollama.com/install.sh | sh # 1. Install Ollama ollama pull glm-4.7-flash:latest # 2. Pull model (~4 GB) export ANTHROPIC_BASE_URL="http://localhost:11434" # 3. Redirect Claude Code export ANTHROPIC_API_KEY="ollama" # 4. Set placeholder auth export ANTHROPIC_AUTH_TOKEN="ollama" export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7-flash:latest" export ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7-flash:latest" export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-4.7-flash:latest" claude # 5. Launch

Backend 2: LM Studio

LM Studio is the right choice if you want a graphical interface for browsing and managing models rather than working entirely in the terminal. Since version 0.4.1, it includes a native Anthropic-compatible /v1/messages endpoint — the same path Claude Code expects — so no translation layer or proxy is needed.

Prerequisites:

macOS, Windows, or Linux

GPU with 6+ GB VRAM recommended (CPU-only is possible but slow)

Download from lmstudio.ai or use the CLI installer for headless servers

Install and configure LM Studio:

On a server or VM without a GUI -- CLI installer

curl -fsSL https://releases.lmstudio.ai/cli/install.sh | bash

Or download the desktop app from https://lmstudio.ai for GUI use

GUI setup steps:

Open LM Studio and search for a coding model (search "qwen coder" or "devstral").

Download the model. LM Studio handles quantization selection automatically.

Go to the Local Server tab (the icon in the left sidebar).

Set the context size. LM Studio recommends starting with at least 25,000 tokens and increasing for better results.

Click Start Server.

Note the port (default: 1234) and copy the model name exactly as shown.

Note: Copy the model identifier exactly. LM Studio displays the exact string you need to pass to ANTHROPIC_DEFAULT_SONNET_MODEL. A mismatch here is the most common failure mode.

Configure Claude Code:

Set the base URL to LM Studio's local server

export ANTHROPIC_BASE_URL="http://localhost:1234" export ANTHROPIC_API_KEY="lm-studio" export ANTHROPIC_AUTH_TOKEN="lm-studio"

Replace the model name with what LM Studio shows for your loaded model

Copy it exactly -- including any version suffix or quantization tag

export ANTHROPIC_DEFAULT_SONNET_MODEL="qwen2.5-coder-32b-instruct" export ANTHROPIC_DEFAULT_HAIKU_MODEL="qwen2.5-coder-32b-instruct" export ANTHROPIC_DEFAULT_OPUS_MODEL="qwen2.5-coder-32b-instruct"

Or persistently in ~/.claude/settings.json:

{ "env": { "ANTHROPIC_BASE_URL": "http://localhost:1234", "ANTHROPIC_API_KEY": "lm-studio", "ANTHROPIC_AUTH_TOKEN": "lm-studio", "ANTHROPIC_DEFAULT_SONNET_MODEL": "qwen2.5-coder-32b-instruct", "ANTHROPIC_DEFAULT_HAIKU_MODEL": "qwen2.5-coder-32b-instruct", "ANTHROPIC_DEFAULT_OPUS_MODEL": "qwen2.5-coder-32b-instruct" } }

How to run:

1. Start the LM Studio server from the GUI (Local Server tab > Start Server)

2. Set environment variables

export ANTHROPIC_BASE_URL="http://localhost:1234" export ANTHROPIC_API_KEY="lm-studio" export ANTHROPIC_AUTH_TOKEN="lm-studio" export ANTHROPIC_DEFAULT_SONNET_MODEL="your-model-name-here" export ANTHROPIC_DEFAULT_HAIKU_MODEL="your-model-name-here" export ANTHROPIC_DEFAULT_OPUS_MODEL="your-model-name-here"

3. Launch

claude

Backend 3: llama.cpp

llama.cpp is the right choice when you need direct control over inference parameters — quantization type, KV cache configuration, batch size, thread count — or when you are running on a server and want the lowest overhead. It has native Anthropic Messages API support, so no proxy or translation layer is needed.

Prerequisites:

A GGUF-format model file (download from Hugging Face; search for "GGUF" versions of any model)

CUDA-capable GPU for GPU inference, or CPU-only for slower inference

CMake and a C++ compiler for source builds (on Linux/CUDA, source is recommended)

Install llama.cpp:

macOS -- Homebrew is simplest

brew install llama.cpp

Linux with CUDA -- build from source for best GPU performance

git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON # Enable CUDA acceleration cmake --build build --config Release # Build

Binaries in ./build/bin/

Linux CPU-only build

cmake -B build cmake --build build --config Release

Windows -- pre-built binaries available at:

https://github.com/ggml-org/llama.cpp/releases

Download the CUDA or CPU variant matching your hardware

Download a GGUF model:

Install the Hugging Face CLI if you do not have it

pip install huggingface-hub

Download GLM-4.7-Flash in Q4_K_XL quantization (~4.5 GB)

This quantization offers

[truncated for AI cost control]