2026-06-30 16:06 UTCIn-site rewrite4 min readUpdated: 2026-06-30 16:26 UTC

The End of Tokenmaxxing

Tokenmaxxing—burning tokens to fake productivity—is dying as individuals and companies wake up to AI costs. GitHub Copilot's shift to credit-based billing, along with reasoning models and agents, has drastically increased token consumption. AI providers are moving from growth-at-all-costs to profitability, leading to price hikes. Token optimization and accountability are now the norm.

SourceO'Reilly AI & ML RadarAuthor: Mike Loukides

The practice of tokenmaxxing appears to be dying out, even before I had a chance to write about it. Good riddance. Burning tokens to create the appearance of productivity was fated to last only until the accountants learned about it, and the strictest of all accountants is one’s personal checkbook. What got many developers thinking about the cost of AI was the change in GitHub Copilot’s usage charges. The cost of Copilot went from a monthly fee with unlimited use to a monthly fee that purchased a limited number of credits, which are used to pay the AI provider of your choice. One credit is equivalent to US$0.01; when you’ve used up your credits, you can upgrade your account or pay for additional credits as you go.

The question isn’t why this didn’t happen earlier; it’s why this happened now. Tokenmaxxing is both the creation and victim of two large-scale trends in AI. First, starting with OpenAI, the major AI providers were all playing a blitzscaling game that prioritized user growth over profitability. Giving AI services away for free got you more users, and in the long run, scalers would figure out how to make money from end-user fees, selling user data, or advertising. This process inevitably ends in enshittification, and is still very much the road we’re on.

Second, token usage exploded late in 2025. The appearance of “reasoning models,” which use tokens to maintain an internal dialog in the course of solving a problem, increased the number of tokens used to respond to each prompt. Reasoning tokens are a model’s conversation with itself about possible responses to the prompt, and are often more numerous than the prompt and response themselves. Whether or not users see the reasoning process (often they don’t), reasoning tokens add to the bill. They are frequently counted as “output tokens” because they are generated by the model, and are more expensive than input tokens.

The appearance of agents also multiplied the rate at which users consumed tokens. In May, 2025, Simon Willison quoted Anthropic’s Hannah Moran’s definition of an agent: “Agents are models using tools in a loop.” The Tredence blog writes: “The agent loop is a repeating cycle in which the AI reads the current data, thinks through what it means, chooses an action, carries it out, checks what happens and starts over.” If you’ve ever watched Claude Code, OpenClaw, or any other agent work, a single request can become many calls to a model, each one using hundreds of tokens, if not thousands. In addition to the current request, one agent-generated invocation can contain the task’s entire accumulated context and relevant documents. Between reasoning tokens and agents, token usage goes up by a factor of hundreds.

The increase in token usage might not be an issue if it results in problems being solved and tasks completed more effectively. But it collides with the loss-leader pricing of the blitzscalers; their willingness to operate at a loss to gain control of a market has limits. Regardless of whether the number of AI users is increasing, the amount of computation, and therefore cost, per user grows as the use of agents increases. Reasoning models increased token usage; agents compounded the problem; and that led to price increases.1 Microsoft/GitHub doesn’t want to pay Copilot customers’ AI bills. We haven’t yet seen across-the-board price increases from the AI providers themselves. But we have seen GitHub’s token credits, and we have seen Anthropic and OpenAI price more capable models significantly higher than older or less capable models. Fable is twice as expensive as Opus 4.8, and while some writers have called this pricing “fantastic,” that’s probably because they were expecting an even greater increase. While Fable can delegate tasks to Anthropic’s less expensive models, most early users observe that with Fable, token use goes up rather than down. Anthropic’s switch to token-based billing for its agent SDK (currently on hold) is another signal that the days of inexpensive AI are coming to an end. OpenAI’s story is similar: GPT 5.5 costs twice as much GPT 5.4 per million tokens.

It’s also important to take capacity into account. Huge data centers have been in the news, but those data centers haven’t been built yet. More important, the electrical infrastructure needed to support those data centers—transmission lines, generators—hasn’t been built either, and that’s not an investment over which AI companies have much control. They can build their own power generation facilities on a data center campus, but that’s a huge investment in technologies that they’re not familiar with. And even if you generate power locally, you need other kinds of infrastructure: rail for coal, pipelines for gas. This isn’t (yet) an essay about data center power consumption and its consequences, but it is another factor that limits increased token usage. We’ve seen Anthropic’s outages blamed on capacity, and Anthropic has responded by leasing unused data center capacity from SpaceX. But the other way to respond to increased demand that can’t be met by current capacity is to increase prices, limiting customers to those who can afford to pay. That increase is being noticed by managers, accountants, and independent developers.

Token optimization and accountability are the inevitable consequence of upward pressure on token price. One way to build accountability is through better governance, a route Bennie Haelen describes in “The Subsidy Ended: What Tool-Using Agents Actually Cost.” Better governance is achieved through building an observability layer that lets you see exactly what the agents and models are doing. With a well-designed observability layer, you can see whether the data sent to the model is growing with each invocation, whether the model is using appropriate tools, whether tools are being called repeatedly, and a lot of other information that will tell you whether your agent is running efficiently.

Another piece of token accountability is understanding which models are running your agent’s requests. General-purpose reasoning models range from expensive high-performance models like Claude Fable or Opus 4.8 to models like Gemma 4 26B that can run on a well-equipped laptop, and some models that are even smaller. While it’s tempting to say “I need the best; I’ll run Opus 4.8 or Fable with maximum reasoning,” most requests don’t require that level of reasoning or expense. Agents will be able to decide what model is best for processing every request. Fable can delegate, and we expect other frontier providers to follow as models incorporate agent capabilities. And there’s an active world of open models outside of the frontier AI providers. Vicki Boykis writes that models running locally now work almost as well as frontier models. Tools like OpenRouter give you a model-independent way of routing requests to different models, including open models that run locally. OpenRouter can be integrated with OpenClaw, Claude Code, Cursor, Codex, and other agents to provide intelligent routing.

Tokenmaxxing is dying. It will no doubt take time for its vestiges to die away, and there will always be developers who think they can game the path to a promotion, along with managers who insist on being “all in” with AI. But spending tokens responsibly is now the norm, whether you pay with your own checkbook or a company account. Token optimization will only become more important as per-token charges increase. They undoubtedly will.

Footnotes

Some articles make the strange claim that tokens have gotten cheaper by up to 98%. GPT-5.5 suggests that these writers are considering the work that can be done per token. That comparison may be worthwhile, though it’s unclear how to compare GPT-3 with 5.5 or Fable meaningfully. For this article, a token is a token. ︎