2026-07-03 22:39 UTCIn-site rewrite6 min readUpdated: 2026-07-03 23:39 UTC

Save Claude Code Tokens with Smart Routing

Brick is a Mixture-of-Models routing gateway that analyzes prompt capability and complexity to route to the most cost-effective model, reducing costs without sacrificing quality. It integrates seamlessly with Claude Code and Codex, offering five modes to balance cost and quality.

SourceHacker News AIAuthor: FrancescoMassa

Article intelligence

EngineersAdvanced

Key points

Brick uses single-forward routing to avoid cascade waste, with six-dimensional capability awareness and complexity assessment.
Seamless integration with Claude Code and Codex via CLI, with five cost/quality modes selectable from the thinking effort slider.
Supports keyword rules for hard overrides or soft biases, and continuous r knob for cost-quality trade-off.
Provides observability through a live dashboard showing routing distribution, difficulty mix, and estimated savings.

Why it matters

This matters because brick uses single-forward routing to avoid cascade waste, with six-dimensional capability awareness and complexity assessment.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

Uh oh!

There was an error while loading. Please reload this page.

Notifications You must be signed in to change notification settings

Fork 0

Star 10

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

1 Commit

.github/workflows

apps

deploy

docs

evals

packages

scripts

.dockerignore

.editorconfig

.gitattributes

.gitignore

.pre-commit-config.yaml

LICENSE

Makefile

NOTICE

README.md

config.yaml

package-lock.json

package.json

pricing.yaml

pyproject.toml

Repository files navigation

Brick is a Mixture-of-Models (MoM) routing gateway. It reads each prompt's capability and complexity, then routes it to the best backend in a pool of open- and closed-weight LLMs, matching the strongest single model's quality at a fraction of its cost. No cascades. No wasted calls. Drop-in model: "brick".

When to use Brick · Quickstart · Why Brick · Claude Code · Codex · FAQ · Benchmarks · How it works · Paper

🧩 When can I use Brick?

Brick is for anyone running against more than one model, or paying flat rate for a single strong one. Three common cases:

You have a pool of models and want each query to reach the right one. Cheap prompts should not burn your most expensive model, and hard prompts should not be starved on a small one. Brick reads capability and complexity per query and dispatches accordingly, so the pool works as one graded system instead of a manual pick.

You want to cut Claude Code / Codex costs without losing quality. Put Brick in front of your coding agent and every request is routed to the cheapest model that can actually do the job, escalating only when the task needs it. You keep the same UX and pay for the hard turns, not the easy ones.

You want to unify different models behind one tool. Use OpenAI models, GLM, DeepSeek, Kimi, Qwen and others from inside Claude Code or Codex through a single OpenAI-compatible endpoint. Define the pool once in config.yaml and call model: "brick" everywhere.

⚡ Quickstart

The fastest working path today is the CLI, which self-hosts the router and wires it into Claude Code for you. Requires Node >= 18 and Docker.

git clone https://github.com/regolo-ai/brick-SR1.git cd brick-SR1/apps/cli && npm install && npm run build && npm link

brick claude on # starts the router + wires ANTHROPIC_BASE_URL in ~/.claude/settings.json

Then open a new Claude Code session and pick brick-claude in the /model picker. Every request now routes to haiku / sonnet / opus by capability and complexity. See Brick + Claude Code for modes, the effort picker, and the live brick claude status dashboard.

Prefer a raw OpenAI-compatible gateway (no CLI)?

Once the Docker image is published (see Distribution channels), you'll be able to run the gateway directly:

docker run --rm -p 18000:18000 \ -e REGOLO_API_KEY=$REGOLO_API_KEY \ ghcr.io/regolo-ai/brick:latest # published at the next v2.1.0 tag

Then call it like any OpenAI endpoint, just set "model": "brick":

curl http://localhost:18000/v1/chat/completions \ -H "Authorization: Bearer $REGOLO_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"brick","messages":[{"role":"user","content":"Prove that sqrt(2) is irrational"}]}'

The x-selected-model response header tells you which backend Brick picked. That math prompt routes to a reasoning model; "Hello" routes to the cheapest one.

Until then, brick serve (from the CLI above) runs the same router locally from source.

🤔 Why Brick

Single model RouteLLM FrugalGPT / Cascade Brick

One call per query (no cascade waste) ✅ ✅ ❌ ✅

Capability-aware (6 dimensions) n/a ❌ binary ❌ ✅

Complexity-aware n/a partial ✅ ✅

Pool of N open + closed models n/a 2 few ✅

Continuous cost ↔ quality knob ❌ ❌ threshold ✅ r ∈ [-1, 1]

Native multimodal (image / audio) varies ❌ ❌ ✅

Drop-in OpenAI-compatible n/a n/a n/a ✅

Cascade routers (FrugalGPT, Cascade Routing) call models one after another until a confidence check passes, paying for every miss in tokens and latency. Brick makes a single forward decision per query, so there is nothing to waste.

🧠 Brick + Claude Code

gosmiulator.mp4

Put one OpenAI/Anthropic-compatible endpoint in front of Claude Code, and Brick routes every request to haiku, sonnet, or opus based on capability and complexity. You keep the Claude Code UX; Brick picks the cheapest model that can do the job.

Setup

brick claude on # wires ANTHROPIC_BASE_URL in ~/.claude/settings.json, auto-starts the router

Then:

Open a new Claude Code session (your current session is unaffected).

In the /model picker, select brick-claude (it sits alongside the built-in opus/sonnet/haiku aliases, which it does not replace).

To revert:

brick claude off # restores ANTHROPIC_BASE_URL, optionally stops the router

Use brick claude on --no-start to require an already-healthy router instead of auto-starting one, and brick claude off --stop / --keep to control the router without a prompt.

The 5 modes: pick your cost/quality trade-off

A mode is how you tell Brick how much to spend. Each one maps easy/medium/hard queries to a model tier, from cheapest (eco, always haiku) to strongest (max, always opus), with lite, mid and pro in between. Pick one and Brick handles the per-query routing inside it.

2026-07-03.23-55-05.mp4

You switch mode straight from the thinking effort slider in Claude Code's /model picker: low picks eco, medium lite, high mid, xhigh pro, and max max. So the effort control does not set a thinking budget, it selects the model tier. You can also switch explicitly with brick claude mode or brick claude .

mid is the default. On 1M-context requests the map shifts up since Haiku has no 1M variant: easy and medium resolve to sonnet, hard to opus.

Once you have picked the tier, how hard to think is decided autonomously per request from the router's own signals (query difficulty plus the chosen model's headroom).

Native models bypass the router

Selecting opus, sonnet, or haiku explicitly in the picker skips Brick entirely: the request is forwarded verbatim to that exact model, with no skill routing and no effort override. Only brick-claude runs the router.

Observability

brick claude status # live dashboard (default in an interactive terminal) brick claude status --once # static one-shot view

The dashboard reports, since the last router restart:

Routed by model: count and percent per model.

Per-model effort distribution: how reasoning effort spread out within each model.

Difficulty mix: the classifier's easy/medium/hard verdicts across routed requests.

Economy: an estimated saved ~X% vs all-opus over the routed request count (a relative estimate from request mix, excluding real token counts and caching).

It also shows connection/wiring state, classifier latency (avg, p50, p95), and fallback rate.

Works with workflows and subagents

Brick routing is per request. In Claude Code workflows and subagents, each agent's call is routed independently as long as that agent uses brick-claude, so a cheap subagent task can land on haiku while a hard one escalates to opus in the same run.

🤖 Use it on Codex

The same idea behind OpenAI Codex: Brick sits in front of Codex and routes each request across your model pool, so you cut cost on easy turns and can drive Codex with non-OpenAI models through one OpenAI-compatible endpoint.

Setup

brick codex on # sets model/model_provider to brick in ~/.codex/config.toml, auto-starts the router

This materializes a dedicated Codex profile (the OpenAI-pool skill router) and adds a managed provider pointing at the local router. Start a new Codex session and it now routes through Brick.

To revert:

brick codex off # restores your previous Codex model/provider

Codex exposes the same 5 modes and status view as Claude Code:

brick codex mode # or: brick codex eco | lite | mid | pro | max brick codex status # live routing dashboard

Use brick codex on --no-start to require an already-healthy router instead of auto-starting one. The Claude and Codex router stacks share host port 8000, so only one can serve at a time; stop the other before wiring.

🔌 Use Brick on its own

You do not need a coding agent. Brick is a plain OpenAI-compatible gateway you can call from any client, script, or app.

brick serve # docker compose up on http://localhost:18000 brick chat # TUI chat against the local router brick route "what is 2+2?" # print the routing decision for a prompt, no call made

Call it like any OpenAI endpoint, just set "model": "brick":

The x-selected-model response header tells you which backend Brick picked. That math prompt routes to a reasoning model; "Hello" routes to the cheapest one.

Configure the pool in config.yaml

Everything Brick decides comes from config.yaml. The core block is skill_router, where you declare the pool, each model's skill vector, and its cost weight:

skill_router: enabled: true capabilities: # the 6 dimensions every query and model live in

coding
creative_synthesis
instruction_following
math_reasoning
planning_agentic
world_knowledge

models:

model: "qwen3.5-9b"

skill_vector: [0.71, 0.51, 0.81, 0.91, 0.58, 0.18] # capability per dimension use_reasoning: false cost_weight: 0.10 # relative price, drives the cost bias

model: "deepseek-v4-flash"

skill_vector: [0.82, 0.66, 0.86, 0.93, 0.62, 0.49] use_reasoning: false cost_weight: 0.40

model: "kimi2.6"

skill_vector: [0.90, 0.75, 0.87, 0.94, 0.64, 0.34] use_reasoning: true reasoning_effort: "medium" cost_weight: 0.60

Add or swap any OpenAI-compatible backend here; the backends themselves are declared under provider_profiles / model_config (the shipped config points them all at Regolo). Two more blocks let you nudge routing without touching the math:

keyword_rules:

name: "force_coder" # hard override: send these prompts to a specific model

mode: "override" model: "kimi2.6" operator: "OR" keywords: ["debug", "refactor", "compile", "write a function"]

name: "coding_bias" # soft nudge: push one capability dimension up

mode: "bias" capability: "coding" operator: "OR" keywords: ["python", "rust", "sql", "async"]

Other useful sections: brick (multimodal preprocessing: STT, OCR, vision), the r preference knob in r ∈ [-1, 1] (max-saving to max-quality), and the classifier endpoints. The CLI can edit most of this for you (brick add model, brick config edit), or edit the YAML directly. Full field reference: apps/router/README.md.

🗂️ What's in the repo

A monorepo to run, use, and reproduce every result in the Brick paper.

Component Path Purpose

Router (Go + Rust) apps/router/ OpenAI-format gateway: capability + complexity classifiers, dispatch to the best backend

CLI (brick) apps/cli/ TypeScript/oclif companion to self-host in one command

Training packages/training/ ModernBERT capability sweep + complexity LoRA recipes

Evaluation packages/evals/ Dataset A pipeline + 3-judge majority-vote panel

Baselines packages/evals/baselines/ Zero-shot RouteLLM, FrugalGPT, Cascade comparisons

Paper docs/paper/ LaTeX source, figures, compiled PDF

Full directory tree

brick-SR1/ ├── apps/ │ ├── router/ # Go + Rust gateway (was vLLM Spatial Router fork) │ │ ├── src/spatial-router/ # Go (HTTP proxy, routing pipeline) │ │ ├── candle-binding/ # Rust (ML embeddings via candle) │ │ ├── ml-binding/ # Rust (L

[truncated for AI cost control]