AI News HubLIVE
站内改写5 min read

Linear Thinking, Nonlinear Costs

Coding agents simplify building AI agent workflows but obscure nonlinear cost scaling. Classical CS optimizations like memoization, pruning, and dynamic programming are essential to avoid repeated work and high costs.

SourceO'Reilly AI & ML RadarAuthor: Nicole Koenigstein

Many AI agent systems become economically unsustainable long before they become technically impressive. Teams usually focus on model choice, prompt design, tool calling, and orchestration. Those things matter, but they are only part of the system setup. The deeper issue is that coding agents, such as Claude Code, Codex, and Jules, make agent workflows easier to generate. But when implementation is abstracted away, the underlying mechanics become harder to see. Bad engineering used to produce slow code. Now it produces expensive systems that also happen to be slow.

When we design agent systems, we still need to remember that the costs scale nonlinearly. A single user request rarely triggers a single model call. It expands into routing, retrieval, reasoning, reflection, guardrail checks, tool calls, and synthesis. Each step may repeat shared context, reload state, recompute a planner decision, or retry a failed path. What looks like an intelligent workflow can therefore behave like a recursive, stateful computation with overlapping subproblems. If that sounds like backtracking, dynamic programming, and memoization to you, you’re right.

We already know how to optimize systems like this. The problem is that coding agents make agent systems easier to generate, but not necessarily easier to optimize. Unless we recognize the underlying mechanics, we may never ask our coding agents to apply the optimization patterns that keep our systems viable.

Old problems wearing new clothes

When we use coding agents to generate agent architectures, it’s tempting to stop at “the trace looks reasonable.” The tool can generate routers, retrievers, planners, evaluators, guardrails, tool interfaces, and synthesis steps. It may also know about caching, pruning, memoization, and state modeling. But it won’t necessarily implement those patterns unless you ask for these optimization layers explicitly.

Even if you work with agent instructions, unless your SKILL.md, AGENTS.md, or project instructions include constraints around repeated context, memoization, cache invalidation, pruning, and cost per request, your resulting agent system may be functionally correct and economically wasteful at the same time. That’s the tricky part: The code can pass review, the unit tests can pass, and the architecture can look reasonable. The invoice is where the hidden computation finally shows up.

It’s easy to give too much agency to tools like Claude Code. When a coding agent reasons in language, calls tools, reflects, and produces fluent text or code, it can feel like a knowledgeable coworker. At the interface level, that impression is understandable. These tools help teams generate more code, move faster, and become more productive. Still, this doesn’t remove the need for engineering craft underneath. Someone still has to recognize repeated context, recomputed planner decisions, correlated retries, unpruned branches, and state that can’t be reused. The coding agent can implement the system, but the engineer still has to understand what kind of system should be implemented. This is where old computer science returns, not as theory but as the optimization layer our agent systems need in production.

The cost multiplier, repeated-work problems, and backtracking

The cost multiplier often shows up first as latency. The user doesn’t see the router, the retries, the reflection loop, or the tool calls. They only see that the agent is taking too long. From the outside, the system looks stuck or broken. From the inside, it may simply be repeating work.

This is one of the uncomfortable differences between traditional software and agent systems. In a conventional application, a failed operation often throws an error, times out, or leaves a trace that is easy to inspect. In an agent workflow, failure can look like effort to improve reliability. Take the weakest step in your agent workflow. If it succeeds 60% of the time, and you try to push it close to 99% reliability through retries, you need 5 retries:

1 − (1 − 0.60)5 = 0.98976

This math assumes each retry is a roll of fair dice. LLMs aren’t dice. Whether you’re using greedy decoding or probabilistic sampling, the model is still drawing from the same underlying distribution shaped by your prompt. If the first “thought” is a hallucination or logic error, bumping the temperature won’t fix the underlying state. You aren’t buying independent trials; you’re just sampling different paths through the same flawed map and state.

This is where the old algorithmic framing matters. In a backtracking problem, you don’t keep walking down the same failed branch and call it progress. You return to the last valid state, mark the failed path, and use the failure as information for the next choice. The point isn’t just to try again. The point is to try again under a changed state.

Agent workflows need the same discipline. A retry shouldn’t mean “run it again and hope.” It should give the model structured feedback about why the previous attempt failed: which constraint failed, which tool result was invalid, which schema didn’t validate, which assumption was unsupported, or which branch added nothing. The next attempt should then change something meaningful: the prompt, the tool choice, the retrieved evidence, the validation constraint, or the planner state.

Memoization, pruning, and dynamic programming

Prompt caching is usually the first optimization. If every step repeats the same system prompt, tool definitions, schema constraints, examples, and policy rules, then caching the shared prefix is an obvious win. It reduces the cost of repeated context. But prompt caching only recognizes that text repeats. It doesn’t notice that decisions repeat.

In many agent systems, the expensive unit isn’t only text. It’s the repeated decision. If the same or equivalent state appears again, paying the model to rediscover the same action is unnecessary. That is what memoization does: It turns repeated computation into lookup. In classical algorithms, the repeated computation might be a recursive subproblem. In an agent system, it might be a planner decision over the same task, facts, tools, and constraints. The planner can be treated as a function over state:

πLLM(St)→at+1^πLLM(S_t) \rightarrow a_{t+1}

where StS_t is the current state of the workflow and at+1a_{t+1} is the next action. Without memoization, this function is evaluated again and again through an LLM call. With memoization, the system first checks whether it has seen the same or equivalent state before. If you want a deeper walkthrough of how to use memoization, I cover it in AI Agents: The Definitive Guide.

But memoization only helps once the system knows which states are worth revisiting. Pruning handles the other side of the problem: branches that shouldn’t be explored further. However, don’t limit pruning to KV cache pruning or speculative decoding. Use it also when a tool repeatedly returns no new information. Your next LLM call shouldn’t be a slightly reworded version of the same query. If a reflection loop keeps producing stylistic changes without improving correctness, the loop should stop. If a search path violates a constraint or depends on an unsupported assumption, it should be marked as unproductive and removed from the active search space.

Dynamic programming becomes relevant when different branches of the workflow solve overlapping subproblems. A research agent may ask similar questions across several documents. A coding agent may inspect the same dependency chain from different entry points. A business analysis agent may compute the same metric for several report sections. If every branch solves these subproblems from scratch, the system pays repeatedly for work it has already done. Table 1 shows examples of how these patterns map to AI agent systems.

Table 1. Classical optimization patterns applied to AI agent systems

OptimizationThe “old” CS wayThe “agent” way

MemoizationStore results of expensive function calls.Cache decisions. If the agent saw this state before, don’t ask it to reason again.

PruningCut off search paths in a tree that won’t lead to a solution.Kill a reflection loop when the critique stops yielding structural improvements.

Dynamic programmingBreak problems into overlapping subproblems. Share codebase analysis across multiple specialized agents instead of rereading files.

This isn’t nostalgia. These patterns mitigate the cost structure of agent systems. Memoization reduces repeated decisions. Pruning reduces repeated failure. Dynamic programming reduces repeated subproblem solving. Together, they form the optimization layer many agent architectures are missing in production.

Where to start: Optimization follows topology

The patterns above aren’t a checklist you apply uniformly. Each multi-agent topology, whether centralized, decentralized, independent, or hybrid, distributes communication and coordination differently, which directly affects overhead, latency, and failure propagation. The optimization layer has to follow.

Centralized A single orchestrator decides, delegates, and aggregates. The expensive unit is the orchestrator’s decision, repeated across similar inputs. Memoize the planner first.

Decentralized Agents coordinate peer-to-peer, exchanging messages without a central authority. The cost moves into the communication itself: redundant exchanges, restated context, agents reasoning over the same shared state from different angles. Prompt caching on the shared context is the first win, followed by pruning exchanges that no longer add information.

Independent/swarms Lightweight agents fan out without coordinating. Cheap individually, expensive in aggregate. If three of your ten agents ask semantically equivalent questions, you pay three times for the same answer. Memoization and pruning aren’t optimizations here; they’re load-bearing.

Hybrid The repeated work shows up at two scales: within a cluster (overlapping subproblems among peers) and across clusters (the coordinator rediscovering the same routing decision). Use dynamic programming on shared subproblems inside the cluster, memoization on the coordinator’s decisions across them.

The optimization layer isn’t a generic discipline you bolt on. It’s a function of the shape of the implementation. Coding agents made it easy to generate the shape without seeing it. The craft is in seeing it anyway.