2026-06-23 17:56 UTCIn-site rewrite2 min readUpdated: 2026-06-23 18:13 UTC

99% compressed, 1% on the bill: I audited 1B tokens to find out why

The author found that token costs paradoxically increased as models became cheaper, due to higher consumption. After investigating existing compression methods, they found them ineffective for agent-specific data like logs and SQL. They developed a custom architecture achieving 99.9% compression margin, but real savings depend on usage patterns.

SourceHacker News AIAuthor: josuramos

Late last year I went through this firsthand: massive token burn on my team. And a paradox you might recognize. The cheaper tokens got, the higher the bill came in. Better, faster models invite heavier use, and consumption grows faster than prices fall. Budgets kept rising instead of leveling off.

I tried everything to cut costs. Eventually I looked into data-compression companies and GitHub repos, and here I had a home-field advantage: I come from building bots for prediction markets, and squeezing data is what I've done for years. To my surprise, it was disappointing. What I found compressed text as text, in the academic state-of-the-art style (LLMLingua and its successors): it drops the tokens with the lowest statistical weight. Works for prose. Fails on exactly what an agent eats all day, like logs, SQL schemas, diffs, stack traces, test output, and API responses. Generic compression fails the same way.

To be clear about where the cost comes from: an LLM generates spend on four sides of the bill, which are input, cache write, cache read, and output. Data compression never touches the output, one of the biggest bottlenecks. And in every compressor I looked at, I ran into the same thing. They sold a high compression ratio as if it were savings, with no clear study behind it. It isn't the same thing, and that gap was exactly what I wanted to measure.

Tired of all this, I started working on an architecture of my own. The idea was that the real savings trigger wasn't compressing more, but mapping and tracking correctly what passed through, with savings ceilings that depend on how each person uses it. My compressions reach 99.9% compression margin. But how much of that actually comes off my bill? That was the question that pulled the rest along.

The intuition behind it is simple. Good mapping gives the AI only what it needs to keep working, meaning the fewest tokens in the end. And it isn't only money. It gets faster and more accurate, because the model's attention is finite and a clean context goes further.

But compressing input wasn't enough. I had to look at the output too.