2026-06-17站内改写3 min readUpdated: 2026-06-17

The Economics of AI Reasoning

Since OpenAI released the first reasoning model o1 in 2024, reasoning capabilities have quickly become standard in AI models. However, reasoning consumes significant computational resources; test-time compute can improve accuracy but drastically increases costs. This article analyzes the types of reasoning, its use cases, and its impact on performance and cost, concluding that disabling reasoning for simple tasks can substantially reduce costs and improve speed.

SourceCerebras Blog

Jun 17 2026

Sherif Cherfa

In 2024, o1, the world's first reasoning model, was released to the public by OpenAI, shortly followed by DeepSeek-R1 and o3.

By 2025, almost every single model, regardless of where it had been developed or whether it was open or not, had support for reasoning.

Reasoning, "thinking," "curfuffling," and "cooking," were a force multiplier early on when LLMs were still discovering tool use. O1 and O3 were capable of crunching on problems for hours before returning mostly accurate responses.

The technical term is "test-time compute," essentially expending more computer processing time to increase the accuracy of the output, the model will produce tokens arguing with, and questioning itself before giving the user an answer.

During this same time period, benchmarks started saturating, AI had gotten better at structured outputs: "tool calling," and the frontier had shifted towards agents that must act on your behalf and away from chat-like experiences.

Agentic trajectories value the model's ability to quickly and precisely chain tool calls, and sometimes reasoning gets in the way of that. The longer a model thinks, the less space it has to tool call before compacting.

Given how much progress we had in a single year, determining the cost of reasoning and it’s effect on performance won’t be so straight forward.

From the data we can see:

6x more tokens spent for technical prompts

7-11x longer completion times

10-20% improvement for enabling reasoning

What are the different types of reasoning

Interleaved thinking: this is the current standard. The LLM will think between tool calls, and decide how to act next, weighing the history of the context with the next decision to make.

Adaptive reasoning: the model is trained to decide on its own how much to reason.

Configurable reasoning: the user can select between different levels (low, medium, high).

What is reasoning good for?

An LLM can increase its own accuracy by using more compute during runtime. It does so by generating tokens "exploring" a topic. For example, you might see something akin to an inner dialogue where the model doubts itself, considers alternatives, and even repeats ideas of a plan back to itself. This is an attempt to construct its own context window to make success more likely.

This is also an opportunity for us to teach it how to intertwine partially related concepts like Socratic questioning, double-checking its own work, or planning ahead.

All of this is tremendously valuable for tasks that require a string of precise steps:

Complex single shot challenges

Puzzles, math, logic

Hitting benchmark targets

There's no doubt that increasing thinking budgets improves model performance. Here's a perfect example: there is a ~10% difference between GPT-5.5 (xHigh) and GPT-5.5 (low), and another 10% difference between GPT-5.5 (low) and GPT-5.5 (no reasoning).

This also applies to smaller, open-weight models. For example, here are Qwen-3.6-27B and Gemma-4-31B with reasoning, both beating last year's SOTA Sonnet-4 (with reasoning).

On average, max uncapped reasoning increases performance on coding and agentic benchmarks by ~20%; however, you'd have to spend about 5-10x more output tokens (expensive!) than you would with reasoning off.

How much do most sessions benefit from reasoning?

An analysis of 1000+ of my AI sessions with Codex, Claude Code, Droid, and Pi agents shows that about half of my prompts were incredibly simple and required no reasoning or complex intelligence to complete.

Most of your prompts probably look like:

"Find and open x file in the application"

"Check my email & calendar for what I need to do today"

"Locate processes which are on my machine"

"Clone and analyse this github repository"

"Download and configure x resource"

"Change this media to another format"

Less Reasoning More Speed

It’s time to take a look at reasoning as a speed control, or a cost control toggle. Anthropic and OpenAI both charge 2x for a 1.5-2.5x speedups on their models. What if we could go 7x faster just by decreasing raw intelligence by 20%?

87.5%+ of tokens generated for a model like Qwen3.6-27B are for reasoning, that means I would have to pay 7x more than necessary half the time, additionally for memory constrained environments, we get less kv-cache, which means more compaction, and each compaction reduces performance significantly.

By disabling reasoning we can expect our agents to run for much longer before compacting, and for our bill to be 85% cheaper

For time sensitive work, like fetching files, finding issues, responding to incidents, making incremental updates, and using AI as a component of a system it often makes no sense spending the heavy toll for a 20% boost.

I recommend checking this article for some valuable information on gpt-5.5’s test time compute.