2026-07-01 10:54 UTCIn-site rewrite6 min readUpdated: 2026-07-01 11:30 UTC

Why your AI bill is bigger than it should be

A $287 debugging session prompted Tejas Chopra to create Headroom, an open-source context optimization layer that has saved users $700,000 in five months by compressing and caching context sent to LLMs, treating token waste as a solvable engineering problem.

SourceHacker News AIAuthor: chhum

Article intelligence

EngineersAdvanced

Key points

Most of what we send to LLMs is unnecessary, and we’re paying for it. One $287 AI bill led to a tool that saved users $700,000 in five months.
Token hygiene is the next engineering discipline: treat token budgets like compute credits and measure what a task actually needs, not what it consumes.
Compressing context before it reaches providers gives teams visibility into AI spend that providers have no incentive to offer.
Headroom uses statistical compression, caching, and a retrieval mechanism to reduce token consumption, with different compressors for JSON, code, text, and more.

Why it matters

This matters because most of what we send to LLMs is unnecessary, and we’re paying for it. One $287 AI bill led to a tool that saved users $700,000 in five months.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

Why your AI bill is bigger than it should be

Token hygiene is the next engineering discipline.

July 01, 2026

You have 1 article left to read this month before you need to register a free LeadDev.com account.

Estimated reading time: 13 minutes

Key takeaways:

Most of what you send to LLMs is unnecessary, and you’re paying for all of it. One $287 AI bill led to a tool that saved users $700,000 in five months.

Token hygiene is the next engineering discipline. Treat token budgets like compute credits and measure what a task actually needs, not what it consumes.

Providers compress your data but don’t pass the savings on: compressing context before it reaches them gives teams visibility into AI spend that providers have no incentive to offer.

A $287 debugging session prompted one engineer to rethink how we feed data to large language models (LLMs). The result has saved users an estimated $700,000 in five months.

Tejas Chopra was debugging a Graphics Processing Unit (GPU) failure. Routine procedure for a senior engineer: pull the logs, ask Claude to identify the problem, get on with your day. When the answer came back, he noticed something odd. That single prompt had consumed his entire context window twice over. “I spent a lot of money just asking that one question,” he recalls. He wondered why.

It turned out that the model had read the entire log file multiple times, processing everything before extracting the three lines that actually mattered. By the time Chopra added up his monthly bill, he was looking at $287 for personal project work.

The fix was to rewrite the prompt to ignore INFO lines, and focus only on warnings and alerts. Response time improved and token cost dropped, but Chopra remained perturbed.

“You cannot expect every developer to open their window and curate the prompts to match what they’re looking for,” he says. “People – or models – will blindly say, ‘I need to look at logs, I need to grab the logs.’” To address this, he wondered whether the process could be automated.

The result is Headroom, an open-source context optimization layer for the LLM. On presenting the project at Linux Open Source Summit, Chopra found that the idea really resonated. “Simply put, many companies are struggling firstly to understand where the token spend is, and then optimize for it. Headroom, as an open-source project that can live on your machine, helps with both.”

Before they stopped collecting the stats, Headroom had saved its users an estimated $700,000, and reclaimed 200 billion tokens, in just five months. This early success prompted Chopra to leave his senior engineering job and found Headroom Labs, to explore the idea that most of what we’re sending to LLMs isn’t necessary.

Your inbox, upgraded.

Receive weekly engineering insights to level up your leadership approach.

How Headroom’s compression works

Chopra describes the compression pipeline as having evolved through three distinct stages, each building on the previous one.

The first target was JavaScript Object Notation (JSON), since it is widely used and wasteful when tokenized naively. Whitespace, commas, quotation marks, and nested indentation all cost tokens, without adding semantic meaning. Headroom strips it and converts it to a compact representation that results in “30% savings instantly, without dropping any data,” Chopra says.

Headroom next looks for statistical similarity across values, and compresses accordingly. If 88 out of 90 values in an array fall between 0 and 1, and two are outliers at 99 and 100, you don’t need to transmit all 90 values. You transmit the outliers and a summary: “88 entries between 0 and 1.” The outliers are preserved exactly; the common cases become a single annotation. “That itself is valuable,” Chopra says. “You just need to keep one copy of statistically similar things and the delta.”

Every compressed payload in Headroom is backed by a cache entry with a key, which is a composite of the session ID and a hash of the original data. Since the hash is based on content rather than context, Hash collisions won’t produce cross-session contamination.

The full original payload lives in a local Redis or SQLite instance. Since context that was valid half an hour ago may not be valid now, the cache has a configurable Time To Live (TTL), defaulting to between five and 30 minutes for an individual developer. The expiry forces freshness, without requiring the developer to think about cache invalidation manually.

For enterprise deployments, instead of a local Redis instance, the cache can live in a database such as RDS on AWS, Bigtable on GCP, or Postgres in a private cloud or local data center, depending on which service the organization already uses.

Multiple developers working across multiple sessions can benefit from shared cache entries. A fetched Application Programming Interface (API) response that ten engineers hit on the same afternoon gets compressed and stored once, not ten times. The TTL settings become an organizational decision, configurable centrally.

A risk with compression is that the model may need what you threw away. Chopra’s answer is to leave a tool call in the compressed output. When Headroom compresses a payload, it hashes the original and stores it locally. It then inserts a breadcrumb into the compressed version, which provides a tool definition that the model can call to retrieve the full original data, if it decides it needs it.

If the model is intelligent enough to request more context, the mechanism exists; if it isn’t, nothing is wasted on sending data that the model would have ignored. “I rely on the intelligence of the models to do that,” Chopra says, “plus my own statistical analysis to compress the right stuff out.”

Of course, if the retrieval step fires, it is a full extra tool call with its own latency, although Chopra says that happens in less than 1% of cases. The intent is that it should never be needed: the statistical compression should be conservative enough, and the models sufficiently intelligent, that the compressed version contains everything required to answer the prompt. There’s also a second-order latency effect: passing in fewer tokens means faster processing and a shorter response. On high-throughput workloads, the input compression savings partially offset the first-call overhead.

More like this

How to justify AI investments

Bill Doerrfeld

A different compressor for every context type

Chopra suggests that the numbers work out more favorably than you might expect. This is because while the compression overhead is around 50 milliseconds, Time-To-First-Token (TTFT) from a cold LLM call is typically two to three seconds. Even so, Headroom is currently mid-migration from Python to Rust, specifically to save on latency.

The compress-cache-retrieve approach is used for various other pieces of input context such as code, lock files, web pages, or plain text, but each requires a different compression strategy, so Headroom has a distinct compressor for each.

Code compression uses the abstract syntax tree. Headroom can reason about the code’s structure, and understand which functions are called and which aren’t. Lock files, which can be enormous and almost irrelevant to any given prompt, get their own treatment. Web pages such as documentation, API references, or Stack Overflow answers that may be fetched for context, are processed differently again.

Then there’s unstructured flat text that doesn’t conform to any parseable structure. For this, Chopra trained a small open-source model from scratch. “It looks at every token in the text and either keeps it or drops it,” he notes.

The training signal is based on determining, for each word in a document, whether removing it changes the semantic meaning of the surrounding text. Run that repeatedly across a large corpus and you can train a model that’s essentially learning a compression grammar for natural language. The model, called Kompress Base, is open source and available on Hugging Face. Pass it a financial document today and it will compress it meaningfully, and the model can be further fine-tuned for a given domain.

What Headroom doesn’t compress

Since output tokens are typically priced at five times the input, the cost savings available for output compression are higher than for input. Headroom currently only compresses inputs, but output token compression is in active development, with pull requests (PRs) open at time of writing.

Local file reads, which account for around 60% of the context in typical coding agent flows, are not compressed. Consequently, when an agent reads a source file, it may be looking for a specific line. Compress that file, and you risk dropping that line. The model then falls back to the retrieval tool, adds a round-trip, and the exercise has cost more than it saved.

Instead, Headroom tries to reduce the surface area of what needs to be read in the first place. Tools like Serena or CodeMCP build symbolic indexes and dependency graphs of a codebase. By integrating with them, Headroom can steer an agent toward reading the right five lines in a 100-line file.

Learning from failure

Another interesting feature of Headroom, called ‘learn,’ is a mechanism that mines historical agent sessions for repeated failures and writes corrections back into your CLAUDE.md, or equivalent, files.

Since every developer interacts with AI agents differently, Chopra argues that systems should be curated per individual. Headroom reads the historical session data that coding agents leave behind, uses the model to extract recurring failure patterns, and proposes a correction. “You can only do that by learning from patterns of usage – from their system, via all your historical data.”

The pattern it targets is common. An agent looks for Python at /usr/local/bin/python when the developer’s environment has it at /opt/homebrew/bin/python. It fails. The next session, it tries the same wrong path and fails again. Across ten sessions, a thousand tokens are spent on a mistake that could be fixed with one line in a config file.

New York • September 15 & 16, 2026

Cut through the hype.

Find what works at LDX3 New York

Explore

The integration problem

The compression system in Headroom is technically impressive, but when I asked Chopra what the main challenge with building Headroom is, he cited integration.

Every LLM provider has a different API dialect. Claude’s API differs from OpenAI’s. When you add routing layers (Bedrock, Vertex AI, Azure) those introduce their own variants on top. Furthermore, within a single model family, the API can change between versions in ways that aren’t always clearly documented. Running Claude directly, or via Bedrock or Vertex, each requires a substantially different integration path for what is notionally the same underlying model.

On top of this, the plethora of coding agents and tools makes the compatibility matrix even harder. “You have a multiplicative effect,” Chopra says. “We are now trying to balance that with open source.”

Headroom claims first-class support for Claude and Codex; everything else is marked experimental, with the community filing tickets and contributing fixes as they find edge cases.

Managing your AI bill with token hygiene

There was a brief, deeply embarrassing fashion amongst the silicon valley tech bros for ‘tokenmaxxing,’ the deliberately wasteful practice of maximizing AI token consumption to inflate usage metrics or climb corporate leaderboards, rather than prioritizing business value. When management ties performance to raw token usage, employees game the system.

Tech firms like Meta and Amazon, which has now deprecated its KiroRank leaderboard, introduced internal dashboards tracking token consumption. Employees racked up billions of tokens by running agents or redundant tasks on loop, purely to secure titles like Token Legend, while ignoring both the carbon and fin

[truncated for AI cost control]