Agent Execution Tax
A benchmark of 720 browser agent tasks reveals that structured output reliability, not raw intelligence, is the bottleneck in agentic AI. Gemini 2.5 Flash incurred a 22.9% execution tax due to malformed JSON, while Kimi K2.5 had zero. This tax compounds into higher latency, cost, and failure rates. The report introduces Reliability-Adjusted Accuracy and cost-per-successful-task metrics.
Fireworks AI
Serverless 2.0 is live: control reliability & speed without reserved capacity. Get Started.
Blog
Agent Execution Tax
Agents Don't Fail on Intelligence. They Fail on Execution.
PUBLISHED 5/20/2026
Table of Contents
What 720 browser agent runs revealed about the real bottleneck in agentic AI.
Deployment Readiness Scorecard
The Agent Execution Tax
Definition
Applied to Our Data
How the Tax Compounds
Generalised Formulas
Structured Output Reliability: The Root Cause
The Data
Why This Matters More Than You Think
Reliability-Adjusted Accuracy
Why Nobody Measures This
What This Looks Like in Practice
How We Measured This
The Setup
The Models
Scope of This Benchmark
Cost Per Successful Task > Token Pricing
Inference Latency: The Compounding Story
What the Fireworks Serving Layer Contributes
Per-Site Analysis: Where the Thesis Holds or Breaks
Universal Success
Universal Failure
The Differentiators
Gemini's One Win: Google Flights
What This Means for AI Procurement
Procurement Scorecard
Model Profiles: Three Models, Three Strategies
GLM-5: The Reasoning Powerhouse
MiniMax M2.5: The Best Value
Kimi K2.5: Fastest Inference in This Benchmark
The Vision Question
Kimi K2.5 Vision: Infrastructure Constraint, Not Model Limitation
Closing
Appendix
A. Reproducibility
B. Benchmark Configuration
C. Evaluator Methodology
D. Full Per-Site Breakdown (Numeric Reference)
E. Data Files
What 720 browser agent runs revealed about the real bottleneck in agentic AI.
A Notte × Fireworks AI benchmark report.
Foundation models keep getting smarter. They ace reasoning benchmarks, write fluent code, and pass professional exams. Yet when you put them inside an agent loop, where they must observe a webpage, decide what to do, and output a structured action ten times in a row, they fail roughly half the time.
We ran 720 browser automation tasks across four LLMs to find out why. The answer was not intelligence. It was execution: one model wasted nearly 1 in 5 LLM calls on malformed JSON that had to be retried. That single reliability gap cascaded into higher latency, inflated cost, and lower task success, even though the model's raw reasoning capability was competitive.
We call this overhead the Agent Execution Tax: the ratio of wasted inference to productive inference. For the worst-performing model in our benchmark, that tax was 22.9%. For the best, it was zero.
In agent systems, reliability compounds harder than intelligence. The models that won were not the ones with the best reasoning scores. They were the ones that reliably did what they were told, every time, in the format they were asked for.
In production, that reliability is shaped not just by the model itself, but by the inference infrastructure serving it: structured output consistency, latency predictability, and stable execution under repeated agent loops.
At 10,000 agent tasks per day, a modest production volume, the execution overhead of the worst-performing model costs over $40,000 per year in inference that produces no value. A model that looks cheaper per token can cost significantly more per outcome once retries, failures, and inflated call counts are factored in.
Scope. This is a text-only browser agent benchmark. Results measure structured output reliability and step efficiency in a multi-step agent loop — not general model intelligence, reasoning ability, or multimodal capability. See Scope of This Benchmark below for the full scope statement.
Deployment Readiness Scorecard
If you are evaluating models for an agent deployment, here is how they map to production constraints.
If you need...UseWhy
Maximum task accuracyGLM-557.1% accuracy; 100% on Google Maps, HuggingFace, BBC News, Wolfram Alpha; strongest on structured data extraction and multi-step reasoning
Lowest cost at scaleMiniMax M2.5$0.062 per successful task (2.3x cheaper than Gemini); RL-trained agent that takes the fewest steps (9.8 avg) and rarely retries (1.6%)
Fastest real-time responseKimi K2.52.1s p50 LLM latency; zero parse retries across 852 calls; best for user-facing agents where perceived speed matters
Rigorous procurement evaluationReliability-Adjusted AccuracyToken pricing misleads at the model selection stage; cost per successful task and execution tax are the metrics that reflect what you actually pay for
One-line summary per model:
•GLM-5: Best accuracy, highest cost. Use for compliance workflows, research automation, and tasks where errors carry downstream consequences.
•MiniMax M2.5: Best value. Default choice for scaled production workloads. The $40k/year waste calculation makes it the economically dominant option at volume.
•Kimi K2.5: Best speed, zero execution overhead. Use for customer-facing agents, live demos, and any workflow where response latency affects user trust.
The Agent Execution Tax
A browser agent task looks simple from the outside: go to Amazon, search for a product, extract the price. Under the hood, it is a multi-step loop:
observe page → LLM generates action (as JSON) → execute action → observe new page → repeat
A typical task takes 10 steps. Each step is an LLM call that must return valid structured output: a JSON object specifying which element to click, what text to type, or what data to extract. If the JSON is malformed, the framework retries. And that retry is invisible: it does not show up in task success rates or reasoning benchmarks. It only surfaces as inflated call counts, latency, and cost once you instrument the engine itself.
Definition
Agent Execution Tax = (total_inference_calls − productive_calls) / productive_calls
Productive calls are those that returned valid structured output on the first attempt. The tax measures how much additional inference you pay, relative to the useful work done. Every percentage point is money spent on inference that delivers nothing.
Note the denominator: this is not the same as the raw retry rate (retries / total calls). An 18.6% retry rate translates to a 22.9% execution tax because the denominator shrinks when you remove the wasted calls.
Applied to Our Data
ModelProductive CallsTotal CallsExecution Tax
Kimi K2.58528520.0%
GLM-58698840.6%
MiniMax M2.58158281.6%
Gemini 2.5 Flash72188622.9%
Measured on instrumented runs (90 tasks per model). Zero parse failures (exhausted retries) recorded across all models.
For every dollar of productive inference Gemini produces, you pay an additional 23 cents in waste. Kimi's tax is zero.
(Note: the hero Execution Tax bar chart at the top of the article is the canonical visual for this section; do not duplicate it here. The table above carries the exact numbers for citation.)
How the Tax Compounds
The tax is not a single cost. It stacks across three dimensions:
Token tax. Wasted tokens on malformed responses, plus the full input context re-sent on every retry. Gemini averaged 15,482 input tokens per step; each retry re-sends that entire context for zero productive output.
Latency tax. Each retry adds a full LLM round-trip (~2.5s at Gemini's p50), roughly 12 seconds of dead time per task.
Cascade tax. A retry at step 8 can desync the agent's internal state, causing downstream steps to misinterpret the page and fail. Hardest to measure; most dangerous at scale.
Generalised Formulas
Expected retries per task = n_steps × retry_rate / (1 − retry_rate) Token overhead per task = expected_retries × (avg_input_tokens + avg_output_tokens) Latency overhead per task = expected_retries × avg_call_latency
For a 10-step task with Gemini's 18.6% retry rate: ~2.3 expected retries, ~36,500 wasted tokens, and ~5.7 seconds of dead time per task.
Structured Output Reliability: The Root Cause
Execution tax is the lens. Structured output reliability is what drives it and is one of the most underreported bottlenecks in production agents.
The Data
ModelTotal LLM CallsParse RetriesRetry RateCalls/Task
Gemini 2.5 Flash88616518.6%14.7
MiniMax M2.5828131.6%9.8
GLM-588450.6%10.3
Kimi K2.585200.0%10.2
Gemini 2.5 Flash produced invalid structured output on nearly 1 in 5 LLM calls. The three Fireworks models combined: 18 retries across 2,564 calls (0.7%).
Why This Matters More Than You Think
In a 10-step agent task, the probability that at least one step requires a retry:
•Gemini (18.6% per call): 86.7%
•MiniMax (1.6% per call): 14.9%
•Kimi (0.0% per call): 0%
With Gemini, 87% of tasks experience at least one parse retry. This is not an edge case; it is the default experience. Gemini averaged 14.7 LLM calls per task versus ~10 for the Fireworks models: the extra ~4.7 calls are almost entirely retries and the downstream steps they force.
Reliability-Adjusted Accuracy
Raw task accuracy tells you how often the agent succeeds. It does not account for the cost of getting there. A compound metric, Reliability-Adjusted Accuracy, discounts task success by execution overhead:
Reliability-Adjusted Accuracy = Task Success Rate × (1 − Execution Tax)
ModelTask AccuracyExecution TaxReliability-Adjusted Accuracy
GLM-557.1%0.6%56.8%
MiniMax M2.557.5%1.6%56.6%
Kimi K2.549.7%0.0%49.7%
Gemini 2.5 Flash45.0%22.9%34.7%
The gap between Gemini's raw accuracy (45.0%) and its reliability-adjusted accuracy (34.7%) is the clearest illustration of the execution tax: over a third of Gemini's operational capacity is consumed by execution overhead. The Fireworks models barely move.
Why Nobody Measures This
The parse retry happens inside the LLM engine, before the agent framework ever sees the result. Unless you instrument the engine, retries are invisible. Static benchmarks (MMLU, HumanEval, ARC) measure model intelligence in isolation; they do not measure whether a model can sustain structured output compliance across a multi-step loop. Parse retry rate should be a first-class metric in every agent benchmark.
Prior work has documented adjacent findings. The original WebVoyager paper (He et al., 2024) introduced the benchmark we use here and established the framing that end-to-end web agent performance is a distinct measurement from static model evaluation. AgentBench (Liu et al., 2024) evaluated LLMs across eight agent environments and found large gaps between model-capability scores and task-completion rates in multi-step loops, reinforcing that agent-specific reliability metrics — not MMLU rank — should drive procurement decisions. SWE-bench (Jimenez et al., 2024) extended the same observation to software-engineering agents: models that top reasoning leaderboards resolve only a small fraction of real GitHub issues, because sustained structured execution across long tool-use loops is not what static evals measure.
What This Looks Like in Practice
Task: "Find all Uniqlo locations in Chicago, IL." (Google Maps, from the WebVoyager benchmark)
Both models received the same task, the same browser environment, and the same starting URL.
At a glance:
Kimi K2.5Gemini 2.5 Flash
Steps taken1216
LLM calls1225
Prase retries09
Total duration51.2s97.9s
Total LLM time23.2s57.5s
Input tokens87,063207,971
Output tokens3,2368,411
ResultSuccessSuccess
Both models found the answer. One did it in 51 seconds with 12 clean calls. The other took 98 seconds and made 25 calls to accomplish 16 steps. The difference was not reasoning ability; it was execution overhead.
Kimi K2.5: 12 steps, 0 retries
StepActionLLM time
1Navigate to google.com/maps1.57s
2Click "Accept cookies"1.75s
3Click search input field1.30s
4Type "Uniqlo Chicago IL"1.47s
5Press Enter2.06s
6Click back (close single-location panel)2.20s
7Type "Uniqlo stores Chicago"1.49s
8Click search2.09s
9Click "Nearby" button2.02s
10Type "Uniqlo" in nearby search2.60s
11Click search3.33s
12Submit answer (3 locations found)1.36s
Every call produced valid JSON on the first attempt.
Gemini 2.5 Flash: 16 steps, 9 retries (25 total LLM calls)
StepAction
1Navigate to google.com/maps/
2Click "Accept cookie
[truncated for AI cost control]