2026-06-22 07:50 UTCIn-site rewrite5 min readUpdated: 2026-06-22 08:02 UTC

Headroom – The context compression layer for AI agents

Headroom is an open-source tool that compresses everything AI agents read—tool outputs, logs, RAG chunks, files, and conversation history—before it reaches the LLM, reducing tokens by 60-95% while preserving answer accuracy. It offers library, proxy, agent wrap, and MCP server modes, with reversible compression and cross-agent memory.

SourceHacker News AIAuthor: sibellavia

Uh oh!

There was an error while loading. Please reload this page.

Notifications You must be signed in to change notification settings

Fork 3.2k

Star 45.5k

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

1,664 Commits

.claude-plugin

.devcontainer

.github

REALIGNMENT

benchmarks

crates

docker

docs

e2e

examples

headroom

plugins

scripts

sdk/typescript

sql

tests

wiki

.actrc

.actrc.local.example

.changelog.md

.commitlintrc.json

.dockerignore

.env.act.example

.env.example

.git-blame-ignore-revs

.gitattributes

.gitguardian.yaml

.gitignore

.pre-commit-config.yaml

.release-please-config.json

.release-please-manifest.json

CHANGELOG.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

Cargo.lock

Cargo.toml

Dockerfile

ENTERPRISE.md

Headroom-2.gif

HeadroomDemo-Fast.gif

LICENSE

Makefile

NOTICE

PR.md

README.md

RUST_DEV.md

SECURITY.md

TESTING-copilot-subscription.md

claude_analysis_ttl.py

codecov.yml

deny.toml

docker-bake.hcl

docker-compose.yml

headroom-savings.png

headroom_learn.gif

llms.txt

mkdocs.yml

pyproject.toml

rust-toolchain.toml

uv.lock

Repository files navigation

60–95% fewer tokens · library · proxy · MCP · 6 algorithms · local-first · reversible

Docs · Install · Proof · Agents · Discord · llms.txt · Enterprise

AI agents / LLMs: read /llms.txt here, or fetch the live index / full docs blob.

Headroom compresses everything your AI agent reads — tool outputs, logs, RAG chunks, files, and conversation history — before it reaches the LLM. Same answers, fraction of the tokens.

Live: 10,144 → 1,260 tokens — same FATAL found.

What it does

Library — compress(messages) in Python or TypeScript, inline in any app

Proxy — headroom proxy --port 8787, zero code changes, any language

Agent wrap — headroom wrap claude|codex|cursor|aider|copilot in one command

MCP server — headroom_compress, headroom_retrieve, headroom_stats for any MCP client

Cross-agent memory — shared store across Claude, Codex, Gemini, auto-dedup

headroom learn — mines failed sessions, writes corrections to CLAUDE.md / AGENTS.md

Output token reduction — trims what the model writes back (not just what you send): drops ceremony/restated code and skips deep "thinking" on routine steps. See Output token reduction.

Reversible (CCR) — originals are cached for retrieval on demand

How it works (30 seconds)

Your agent / app (Claude Code, Cursor, Codex, LangChain, Agno, Strands, your own code…) │ prompts · tool outputs · logs · RAG results · files ▼ ┌────────────────────────────────────────────────────┐ │ Headroom (runs locally — your data stays here) │ │ ──────────────────────────────────────────────── │ │ CacheAligner → ContentRouter → CCR │ │ ├─ SmartCrusher (JSON) │ │ ├─ CodeCompressor (AST) │ │ └─ Kompress-base (text, HF) │ │ │ │ Cross-agent memory · headroom learn · MCP │ └────────────────────────────────────────────────────┘ │ compressed prompt + retrieval tool ▼ LLM provider (Anthropic · OpenAI · Bedrock · …)

ContentRouter — detects content type, selects the right compressor

SmartCrusher / CodeCompressor / Kompress-base — compress JSON, AST, or prose

CacheAligner — stabilizes prefixes so provider KV caches actually hit

CCR — stores originals locally; LLM calls headroom_retrieve if it needs them

→ Architecture · CCR reversible compression · Kompress-v2-base model card

Get started (60 seconds)

1 — Install

pip install "headroom-ai[all]" # Python npm install headroom-ai # Node / TypeScript

2 — Pick your mode

headroom wrap claude # wrap a coding agent headroom proxy --port 8787 # drop-in proxy, zero code changes

or: from headroom import compress # inline library

3 — See the savings

headroom perf

Granular extras: [proxy], [mcp], [ml], [code], [memory], [relevance], [image], [agno], [langchain], [evals], [pytorch-mps] (Apple-GPU memory-embedder offload — set HEADROOM_EMBEDDER_RUNTIME=pytorch_mps). Requires Python 3.10+.

Proof

Savings on real agent workloads:

Workload Before After Savings

Code search (100 results) 17,765 1,408 92%

SRE incident debugging 65,694 5,118 92%

GitHub issue triage 54,174 14,761 73%

Codebase exploration 78,502 41,254 47%

Accuracy preserved on standard benchmarks:

Benchmark Category N Baseline Headroom Delta

GSM8K Math 100 0.870 0.870 ±0.000

TruthfulQA Factual 100 0.530 0.560 +0.030

SQuAD v2 QA 100 — 97% 19% compression

BFCL Tools 100 — 97% 32% compression

Reproduce: python -m headroom.evals suite --tier 1 · Full benchmarks & methodology

Output token reduction (cut what the model writes back)

Everything above shrinks the prompt you send. But you also pay for every token the model writes back — and on Opus-class models output costs 5× input. A lot of that output is waste: "Great, let me…" preambles, re-printing code you just showed it, and deep "thinking" on routine steps like reading a file.

Headroom can trim that too, from the proxy, without you changing any code:

Verbosity steering — appends a short "be terse, don't restate context" note to the end of the system prompt (so your prompt cache still hits).

Effort routing — when a turn is just the model resuming after a tool result (a file read, a passing test), it dials the model's thinking effort down. New questions and errors keep full effort.

Turn it on:

export HEADROOM_OUTPUT_SHAPER=1 # off by default headroom proxy --port 8787

Already running a proxy? These switches are read live on every request, so a proxy that headroom wrap reused (rather than started) would not see a value you export afterwards — its environment was snapshotted at launch. headroom wrap now hot-syncs your current settings to the running proxy via a loopback POST /admin/runtime-env, so they take effect immediately with no restart (no cold start, no dropped requests, no lost caches). Set them before you wrap. On a shared proxy these overrides are global — the last explicit setting wins.

Learn the right terseness for you. People don't say how terse they want answers — they show it (they interrupt long replies, or move on before they could have read them). headroom learn --verbosity reads your past sessions and picks the level automatically:

headroom learn --verbosity # preview what it found (dry run) headroom learn --verbosity --apply # save it; the proxy uses it from now on

See how many output tokens you saved. Output savings are counterfactual — we never see what the model would have written — so Headroom reports an honest estimate with a confidence range, never a made-up number:

headroom output-savings

Reduction: 31.7% (95% CI 27.7% … 35.7%) [estimated]

Want a measured number instead of an estimate? Leave 10% of conversations unshaped as a control group: export HEADROOM_OUTPUT_HOLDOUT=0.1. The dashboard shows an Output Tokens Saved card next to input compression, labelled measured or estimated with the confidence band.

→ Full write-up incl. the measurement methodology: docs/proposals/output-token-reduction.md

Agent compatibility matrix

Agent headroom wrap Notes

Claude Code ✅ --memory · --code-graph

Codex ✅ shares memory with Claude

Cursor ✅ prints config — paste once

Aider ✅ starts proxy + launches

Copilot CLI ✅ starts proxy + launches

OpenClaw ✅ installs as ContextEngine plugin

Cortex Code ✅ 60–65% savings · library mode

Any OpenAI-compatible client works via headroom proxy. MCP-native: headroom mcp install.

GitHub Copilot CLI subscription mode

Headroom can route GitHub Copilot CLI subscription traffic through the local proxy:

headroom copilot-auth login headroom wrap copilot --subscription -- --model gpt-4o

This lets Headroom intercept OpenAI-compatible Copilot CLI requests and apply the same proxy compression pipeline before forwarding to GitHub Copilot's hosted API. The wrapper exchanges Headroom's reusable GitHub OAuth token for Copilot's short-lived API token and prints the upstream endpoint as COPILOT_PROVIDER_API_URL=... during launch.

headroom copilot-auth login stores a Headroom-specific Copilot OAuth token. This avoids relying on generic GitHub or Copilot CLI tokens that can read Copilot account metadata but may still be rejected by Copilot's token-exchange endpoint.

For GitHub Enterprise Server or custom-domain Copilot deployments, set the deployment domain before launching:

export GITHUB_COPILOT_ENTERPRISE_DOMAIN=ghe.example.com

For GitHub.com Enterprise Cloud URLs such as github.com/enterprises/your-enterprise, do not set an enterprise-domain override. Headroom uses GitHub's normal token-exchange endpoint and the Copilot API endpoint advertised for the signed-in account.

Platform support note: macOS auth reuse via Copilot CLI Keychain storage has been smoke-tested. Windows Credential Manager, Linux Secret Service / secret-tool, and Docker/CI token-injection paths are implemented or planned as auth-discovery paths, but still need real OS validation before they should be considered fully vetted. For Docker and CI, prefer passing an explicit GITHUB_COPILOT_TOKEN or GITHUB_COPILOT_GITHUB_TOKEN rather than relying on host keychain access.

When to use · When to skip

Great fit if you…

run AI coding agents daily and want savings without changing your code

work across multiple agents and want shared memory

need reversible compression — originals are retrievable via CCR within the configured TTL

Skip it if you…

only use a single provider's native compaction and don't need cross-agent memory

work in a sandboxed environment where local processes can't run

Integrations — drop Headroom into any stack

Your setup Hook in with

Any Python app compress(messages, model=…)

Any TypeScript app await compress(messages, { model })

Anthropic / OpenAI SDK withHeadroom(new Anthropic()) · withHeadroom(new OpenAI())

Vercel AI SDK wrapLanguageModel({ model, middleware: headroomMiddleware() })

LiteLLM litellm.callbacks = [HeadroomCallback()]

LangChain HeadroomChatModel(your_llm)

Agno HeadroomAgnoModel(your_model)

Strands Strands guide

ASGI apps app.add_middleware(CompressionMiddleware)

Multi-agent SharedContext().put / .get

MCP clients headroom mcp install

What's inside

SmartCrusher — universal JSON: arrays of dicts, nested objects, mixed types.

CodeCompressor — AST-aware for Python, JS, Go, Rust, Java, C++.

Kompress-base — our HuggingFace model, trained on agentic traces.

Image compression — 40–90% reduction via trained ML router.

CacheAligner — stabilizes prefixes so Anthropic/OpenAI KV caches actually hit.

IntelligentContext — score-based context fitting with learned importance.

CCR — reversible compression; LLM retrieves originals on demand.

Cross-agent memory — shared store, agent provenance, auto-de

[truncated for AI cost control]