Headroom – The context compression layer for AI agents
Headroom is an open-source tool that compresses everything AI agents read—tool outputs, logs, RAG chunks, files, and conversation history—before it reaches the LLM, reducing tokens by 60-95% while preserving answer accuracy. It offers library, proxy, agent wrap, and MCP server modes, with reversible compression and cross-agent memory.
Uh oh!
There was an error while loading. Please reload this page.
Notifications You must be signed in to change notification settings
Fork 3.2k
Star 45.5k
BranchesTags
Open more actions menu
Folders and files
NameName
Last commit message
Last commit date
Latest commit
History
1,664 Commits
1,664 Commits
.claude-plugin
.claude-plugin
.devcontainer
.devcontainer
.github
.github
REALIGNMENT
REALIGNMENT
benchmarks
benchmarks
crates
crates
docker
docker
docs
docs
e2e
e2e
examples
examples
headroom
headroom
plugins
plugins
scripts
scripts
sdk/typescript
sdk/typescript
sql
sql
tests
tests
wiki
wiki
.actrc
.actrc
.actrc.local.example
.actrc.local.example
.changelog.md
.changelog.md
.commitlintrc.json
.commitlintrc.json
.dockerignore
.dockerignore
.env.act.example
.env.act.example
.env.example
.env.example
.git-blame-ignore-revs
.git-blame-ignore-revs
.gitattributes
.gitattributes
.gitguardian.yaml
.gitguardian.yaml
.gitignore
.gitignore
.pre-commit-config.yaml
.pre-commit-config.yaml
.release-please-config.json
.release-please-config.json
.release-please-manifest.json
.release-please-manifest.json
CHANGELOG.md
CHANGELOG.md
CODE_OF_CONDUCT.md
CODE_OF_CONDUCT.md
CONTRIBUTING.md
CONTRIBUTING.md
Cargo.lock
Cargo.lock
Cargo.toml
Cargo.toml
Dockerfile
Dockerfile
ENTERPRISE.md
ENTERPRISE.md
Headroom-2.gif
Headroom-2.gif
HeadroomDemo-Fast.gif
HeadroomDemo-Fast.gif
LICENSE
LICENSE
Makefile
Makefile
NOTICE
NOTICE
PR.md
PR.md
README.md
README.md
RUST_DEV.md
RUST_DEV.md
SECURITY.md
SECURITY.md
TESTING-copilot-subscription.md
TESTING-copilot-subscription.md
claude_analysis_ttl.py
claude_analysis_ttl.py
codecov.yml
codecov.yml
deny.toml
deny.toml
docker-bake.hcl
docker-bake.hcl
docker-compose.yml
docker-compose.yml
headroom-savings.png
headroom-savings.png
headroom_learn.gif
headroom_learn.gif
llms.txt
llms.txt
mkdocs.yml
mkdocs.yml
pyproject.toml
pyproject.toml
rust-toolchain.toml
rust-toolchain.toml
uv.lock
uv.lock
Repository files navigation
60–95% fewer tokens · library · proxy · MCP · 6 algorithms · local-first · reversible
Docs · Install · Proof · Agents · Discord · llms.txt · Enterprise
AI agents / LLMs: read /llms.txt here, or fetch the live index / full docs blob.
Headroom compresses everything your AI agent reads — tool outputs, logs, RAG chunks, files, and conversation history — before it reaches the LLM. Same answers, fraction of the tokens.
Live: 10,144 → 1,260 tokens — same FATAL found.
What it does
Library — compress(messages) in Python or TypeScript, inline in any app
Proxy — headroom proxy --port 8787, zero code changes, any language
Agent wrap — headroom wrap claude|codex|cursor|aider|copilot in one command
MCP server — headroom_compress, headroom_retrieve, headroom_stats for any MCP client
Cross-agent memory — shared store across Claude, Codex, Gemini, auto-dedup
headroom learn — mines failed sessions, writes corrections to CLAUDE.md / AGENTS.md
Output token reduction — trims what the model writes back (not just what you send): drops ceremony/restated code and skips deep "thinking" on routine steps. See Output token reduction.
Reversible (CCR) — originals are cached for retrieval on demand
How it works (30 seconds)
Your agent / app (Claude Code, Cursor, Codex, LangChain, Agno, Strands, your own code…) │ prompts · tool outputs · logs · RAG results · files ▼ ┌────────────────────────────────────────────────────┐ │ Headroom (runs locally — your data stays here) │ │ ──────────────────────────────────────────────── │ │ CacheAligner → ContentRouter → CCR │ │ ├─ SmartCrusher (JSON) │ │ ├─ CodeCompressor (AST) │ │ └─ Kompress-base (text, HF) │ │ │ │ Cross-agent memory · headroom learn · MCP │ └────────────────────────────────────────────────────┘ │ compressed prompt + retrieval tool ▼ LLM provider (Anthropic · OpenAI · Bedrock · …)
ContentRouter — detects content type, selects the right compressor
SmartCrusher / CodeCompressor / Kompress-base — compress JSON, AST, or prose
CacheAligner — stabilizes prefixes so provider KV caches actually hit
CCR — stores originals locally; LLM calls headroom_retrieve if it needs them
→ Architecture · CCR reversible compression · Kompress-v2-base model card
Get started (60 seconds)
1 — Install
pip install "headroom-ai[all]" # Python npm install headroom-ai # Node / TypeScript
2 — Pick your mode
headroom wrap claude # wrap a coding agent headroom proxy --port 8787 # drop-in proxy, zero code changes
or: from headroom import compress # inline library
3 — See the savings
headroom perf
Granular extras: [proxy], [mcp], [ml], [code], [memory], [relevance], [image], [agno], [langchain], [evals], [pytorch-mps] (Apple-GPU memory-embedder offload — set HEADROOM_EMBEDDER_RUNTIME=pytorch_mps). Requires Python 3.10+.
Proof
Savings on real agent workloads:
Workload Before After Savings
Code search (100 results) 17,765 1,408 92%
SRE incident debugging 65,694 5,118 92%
GitHub issue triage 54,174 14,761 73%
Codebase exploration 78,502 41,254 47%
Accuracy preserved on standard benchmarks:
Benchmark Category N Baseline Headroom Delta
GSM8K Math 100 0.870 0.870 ±0.000
TruthfulQA Factual 100 0.530 0.560 +0.030
SQuAD v2 QA 100 — 97% 19% compression
BFCL Tools 100 — 97% 32% compression
Reproduce: python -m headroom.evals suite --tier 1 · Full benchmarks & methodology
Output token reduction (cut what the model writes back)
Everything above shrinks the prompt you send. But you also pay for every token the model writes back — and on Opus-class models output costs 5× input. A lot of that output is waste: "Great, let me…" preambles, re-printing code you just showed it, and deep "thinking" on routine steps like reading a file.
Headroom can trim that too, from the proxy, without you changing any code:
Verbosity steering — appends a short "be terse, don't restate context" note to the end of the system prompt (so your prompt cache still hits).
Effort routing — when a turn is just the model resuming after a tool result (a file read, a passing test), it dials the model's thinking effort down. New questions and errors keep full effort.
Turn it on:
export HEADROOM_OUTPUT_SHAPER=1 # off by default headroom proxy --port 8787
Already running a proxy? These switches are read live on every request, so a proxy that headroom wrap reused (rather than started) would not see a value you export afterwards — its environment was snapshotted at launch. headroom wrap now hot-syncs your current settings to the running proxy via a loopback POST /admin/runtime-env, so they take effect immediately with no restart (no cold start, no dropped requests, no lost caches). Set them before you wrap. On a shared proxy these overrides are global — the last explicit setting wins.
Learn the right terseness for you. People don't say how terse they want answers — they show it (they interrupt long replies, or move on before they could have read them). headroom learn --verbosity reads your past sessions and picks the level automatically:
headroom learn --verbosity # preview what it found (dry run) headroom learn --verbosity --apply # save it; the proxy uses it from now on
See how many output tokens you saved. Output savings are counterfactual — we never see what the model would have written — so Headroom reports an honest estimate with a confidence range, never a made-up number:
headroom output-savings
Reduction: 31.7% (95% CI 27.7% … 35.7%) [estimated]
Want a measured number instead of an estimate? Leave 10% of conversations unshaped as a control group: export HEADROOM_OUTPUT_HOLDOUT=0.1. The dashboard shows an Output Tokens Saved card next to input compression, labelled measured or estimated with the confidence band.
→ Full write-up incl. the measurement methodology: docs/proposals/output-token-reduction.md
Agent compatibility matrix
Agent headroom wrap Notes
Claude Code ✅ --memory · --code-graph
Codex ✅ shares memory with Claude
Cursor ✅ prints config — paste once
Aider ✅ starts proxy + launches
Copilot CLI ✅ starts proxy + launches
OpenClaw ✅ installs as ContextEngine plugin
Cortex Code ✅ 60–65% savings · library mode
Any OpenAI-compatible client works via headroom proxy. MCP-native: headroom mcp install.
GitHub Copilot CLI subscription mode
Headroom can route GitHub Copilot CLI subscription traffic through the local proxy:
headroom copilot-auth login headroom wrap copilot --subscription -- --model gpt-4o
This lets Headroom intercept OpenAI-compatible Copilot CLI requests and apply the same proxy compression pipeline before forwarding to GitHub Copilot's hosted API. The wrapper exchanges Headroom's reusable GitHub OAuth token for Copilot's short-lived API token and prints the upstream endpoint as COPILOT_PROVIDER_API_URL=... during launch.
headroom copilot-auth login stores a Headroom-specific Copilot OAuth token. This avoids relying on generic GitHub or Copilot CLI tokens that can read Copilot account metadata but may still be rejected by Copilot's token-exchange endpoint.
For GitHub Enterprise Server or custom-domain Copilot deployments, set the deployment domain before launching:
export GITHUB_COPILOT_ENTERPRISE_DOMAIN=ghe.example.com
For GitHub.com Enterprise Cloud URLs such as github.com/enterprises/your-enterprise, do not set an enterprise-domain override. Headroom uses GitHub's normal token-exchange endpoint and the Copilot API endpoint advertised for the signed-in account.
Platform support note: macOS auth reuse via Copilot CLI Keychain storage has been smoke-tested. Windows Credential Manager, Linux Secret Service / secret-tool, and Docker/CI token-injection paths are implemented or planned as auth-discovery paths, but still need real OS validation before they should be considered fully vetted. For Docker and CI, prefer passing an explicit GITHUB_COPILOT_TOKEN or GITHUB_COPILOT_GITHUB_TOKEN rather than relying on host keychain access.
When to use · When to skip
Great fit if you…
run AI coding agents daily and want savings without changing your code
work across multiple agents and want shared memory
need reversible compression — originals are retrievable via CCR within the configured TTL
Skip it if you…
only use a single provider's native compaction and don't need cross-agent memory
work in a sandboxed environment where local processes can't run
Integrations — drop Headroom into any stack
Your setup Hook in with
Any Python app compress(messages, model=…)
Any TypeScript app await compress(messages, { model })
Anthropic / OpenAI SDK withHeadroom(new Anthropic()) · withHeadroom(new OpenAI())
Vercel AI SDK wrapLanguageModel({ model, middleware: headroomMiddleware() })
LiteLLM litellm.callbacks = [HeadroomCallback()]
LangChain HeadroomChatModel(your_llm)
Agno HeadroomAgnoModel(your_model)
Strands Strands guide
ASGI apps app.add_middleware(CompressionMiddleware)
Multi-agent SharedContext().put / .get
MCP clients headroom mcp install
What's inside
SmartCrusher — universal JSON: arrays of dicts, nested objects, mixed types.
CodeCompressor — AST-aware for Python, JS, Go, Rust, Java, C++.
Kompress-base — our HuggingFace model, trained on agentic traces.
Image compression — 40–90% reduction via trained ML router.
CacheAligner — stabilizes prefixes so Anthropic/OpenAI KV caches actually hit.
IntelligentContext — score-based context fitting with learned importance.
CCR — reversible compression; LLM retrieves originals on demand.
Cross-agent memory — shared store, agent provenance, auto-de
[truncated for AI cost control]