2026-05-24 08:06 UTCIn-site rewrite3 min readUpdated: 2026-06-30 13:03 UTC

Researchers let Claude Code discover AI scaling algorithms that humans probably wouldn't have designed

Researchers from UMD, Google, Meta, and other institutions use AutoTTS to let a coding agent independently discover control algorithms for AI reasoning. The algorithm it found cuts compute by about 70 percent compared to standard self-consistency while matching its accuracy. The whole search cost $40 and took 160 minutes.

SourceThe DecoderAuthor: Jonathan Kemper

Article intelligence

EngineersAdvanced

Key points

AutoTTS uses an offline simulation environment to let a coding agent autonomously explore test-time scaling algorithms without human-written rules.
The discovered algorithm achieves higher accuracy per compute on math benchmarks than established methods like self-consistency.
It dynamically adjusts inference paths by tracking model confidence changes rather than relying solely on majority voting.
The research shifts AI algorithm design from hand-crafted rules to constructing search spaces for agents to explore.

Why it matters

This matters because autoTTS uses an offline simulation environment to let a coding agent autonomously explore test-time scaling algorithms without human-written rules.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

Instead of writing rules for more efficient AI reasoning themselves, researchers let a coding agent hunt for better control algorithms in a simulated environment. The result beats established methods while burning far less compute.

Test-time scaling (TTS) is meant to make large language models perform better by letting them spend more compute on a response, say, by running several solution paths in parallel or extending chains of thought. Until now, human-written rules almost always dictated when a model kicks off a new solution path, doubles down on a promising one, or kills it.

A research team from UMD, UVA, WUSTL, UNC, Google, and Meta flips that with AutoTTS. Humans don't write the algorithm. Instead, they build the playground where an AI agent figures out algorithms on its own.

The paper argues that many known methods are really just special cases in a shared control space defined by width (how many solution paths run at once) and depth (how far each one goes). So why, the authors ask, do researchers keep plotting paths through this space by hand instead of letting a machine search it? Simulating the search keeps costs down At the core of AutoTTS sits an offline environment. For each task, the team pre-generates several solution paths from the language model and stores them. A new control algorithm decides how to spend compute based on data that's already there. That way, thousands of variants can run without firing up the actual language model each time.

[caption id="attachment_55748" align="aligncenter" width="1800"] AutoTTS moves the human role from algorithm design to environment design: instead of defining branching, pruning, and stopping rules, the researchers define states, actions, and feedback. An agent then searches for a controller on its own inside that environment. | Image: Zheng et al.[/caption]

Claude Code does the searching. Over several rounds, the agent reviews what came before, spots weaknesses in earlier proposals, and writes a new control algorithm directly in code. To stop the search from getting lost in thousands of tiny knobs, each proposal can only expose one high-level controller to the outside. That controller sets all the other thresholds on its own. Full logs from each run also show the agent where earlier attempts blew compute for nothing.

[caption id="attachment_55747" align="aligncenter" width="1800"] Many established test-time scaling methods map to different paths through the same control space of width and depth. AutoTTS searches for new paths in exactly this space. | Image: Zheng et al.[/caption] Agent-written algorithm outperforms human-designed ones On math benchmarks like AIME and HMMT, the algorithm the agent came up with gets better accuracy per unit of compute than established methods. The lean setting slashes token usage by about 70 percent compared to standard self-consistency, which just generates 64 answers in parallel and picks the winner by majority vote. Accuracy holds steady.

The algorithm also carries over to a different model (DeepSeek-R1-Distill-Llama-8B) and a non-math benchmark (GPQA-Diamond). The whole discovery run cost about $40 and took 160 minutes.

[caption id="attachment_55746" align="aligncenter" width="1800"] Across four model sizes and two math benchmarks, the algorithm AutoTTS found delivers better or comparable accuracy at lower token usage than hand-written methods. | Image: Zheng et al[/caption] A logic humans probably wouldn't have come up with More interesting than the raw numbers is how the discovered program actually works. It tracks how the model's confidence shifts over several rounds. Other methods bail out the moment a majority among answers tips over.

If confidence barely budges, the algorithm opens more solution paths. If it climbs quickly, it skips new ones. Solution paths whose interim result lines up with the current majority get extra compute. The algorithm only drops paths that diverge if they keep heading the wrong way over multiple rounds.

The authors call this kind of coordination something that would've been nearly impossible to design by hand. An ablation study shows how much depends on two design choices: drop the single high-level controller, and the agent falls back on extreme shortcuts that save tons of compute in testing but tank accuracy on new tasks. Without detailed logs, the discovered algorithm eats more compute at worse accuracy, so a bare final result just isn't enough to figure out what went wrong. From writing algorithms to building search spaces The authors put AutoTTS in a line with work like FunSearch, AlphaEvolve, and ADAS, all of which use language models as program searchers. What's new here is applying that idea to test-time scaling, which was mostly done by hand before.

The current version only covers the trade-off between width and depth. It can't handle more complex structures like tree searches. How good the discovery turns out also depends on the coding agent. The authors don't say whether open-source alternatives would work just as well.

The bigger takeaway is that the work shifts where humans come in: instead of inventing the rules themselves, researchers set up the search environment those rules live in. The actual strategy then emerges as code that a language model writes and refines.

As early as 2024, researchers from Hugging Face showed that small language models can match much larger ones through smart test-time compute scaling, though with search strategies designed by hand. Meta and partners recently introduced hyperagents, AI systems that optimize their own improvement process.