2026-06-01站内改写6 min readUpdated: 2026-06-01

Agent Execution Tax

A benchmark of 720 browser agent tasks reveals that structured output reliability, not raw intelligence, is the bottleneck in agentic AI. Gemini 2.5 Flash incurred a 22.9% execution tax due to malformed JSON, while Kimi K2.5 had zero. This tax compounds into higher latency, cost, and failure rates. The report introduces Reliability-Adjusted Accuracy and cost-per-successful-task metrics.

SourceFireworks AI Blog

Fireworks AI

Serverless 2.0 is live: control reliability & speed without reserved capacity. Get Started.

Blog

Agent Execution Tax

Agents Don't Fail on Intelligence. They Fail on Execution.

PUBLISHED 5/20/2026

Table of Contents

What 720 browser agent runs revealed about the real bottleneck in agentic AI.

Deployment Readiness Scorecard

The Agent Execution Tax

Definition

Applied to Our Data

How the Tax Compounds

Generalised Formulas

Structured Output Reliability: The Root Cause

The Data

Why This Matters More Than You Think

Reliability-Adjusted Accuracy

Why Nobody Measures This

What This Looks Like in Practice

How We Measured This

The Setup

The Models

Scope of This Benchmark

Cost Per Successful Task > Token Pricing

Inference Latency: The Compounding Story

What the Fireworks Serving Layer Contributes

Per-Site Analysis: Where the Thesis Holds or Breaks

Universal Success

Universal Failure

The Differentiators

Gemini's One Win: Google Flights

What This Means for AI Procurement

Procurement Scorecard

Model Profiles: Three Models, Three Strategies

GLM-5: The Reasoning Powerhouse

MiniMax M2.5: The Best Value

Kimi K2.5: Fastest Inference in This Benchmark

The Vision Question

Kimi K2.5 Vision: Infrastructure Constraint, Not Model Limitation

Closing

Appendix

A. Reproducibility

B. Benchmark Configuration

C. Evaluator Methodology

D. Full Per-Site Breakdown (Numeric Reference)

E. Data Files

What 720 browser agent runs revealed about the real bottleneck in agentic AI.

A Notte × Fireworks AI benchmark report.

Foundation models keep getting smarter. They ace reasoning benchmarks, write fluent code, and pass professional exams. Yet when you put them inside an agent loop, where they must observe a webpage, decide what to do, and output a structured action ten times in a row, they fail roughly half the time.

We ran 720 browser automation tasks across four LLMs to find out why. The answer was not intelligence. It was execution: one model wasted nearly 1 in 5 LLM calls on malformed JSON that had to be retried. That single reliability gap cascaded into higher latency, inflated cost, and lower task success, even though the model's raw reasoning capability was competitive.

We call this overhead the Agent Execution Tax: the ratio of wasted inference to productive inference. For the worst-performing model in our benchmark, that tax was 22.9%. For the best, it was zero.

In agent systems, reliability compounds harder than intelligence. The models that won were not the ones with the best reasoning scores. They were the ones that reliably did what they were told, every time, in the format they were asked for.

In production, that reliability is shaped not just by the model itself, but by the inference infrastructure serving it: structured output consistency, latency predictability, and stable execution under repeated agent loops.

At 10,000 agent tasks per day, a modest production volume, the execution overhead of the worst-performing model costs over $40,000 per year in inference that produces no value. A model that looks cheaper per token can cost significantly more per outcome once retries, failures, and inflated call counts are factored in.

Scope. This is a text-only browser agent benchmark. Results measure structured output reliability and step efficiency in a multi-step agent loop — not general model intelligence, reasoning ability, or multimodal capability. See Scope of This Benchmark below for the full scope statement.

Deployment Readiness Scorecard

If you are evaluating models for an agent deployment, here is how they map to production constraints.

If you need...UseWhy

Maximum task accuracyGLM-557.1% accuracy; 100% on Google Maps, HuggingFace, BBC News, Wolfram Alpha; strongest on structured data extraction and multi-step reasoning

Lowest cost at scaleMiniMax M2.5$0.062 per successful task (2.3x cheaper than Gemini); RL-trained agent that takes the fewest steps (9.8 avg) and rarely retries (1.6%)

Fastest real-time responseKimi K2.52.1s p50 LLM latency; zero parse retries across 852 calls; best for user-facing agents where perceived speed matters

Rigorous procurement evaluationReliability-Adjusted AccuracyToken pricing misleads at the model selection stage; cost per successful task and execution tax are the metrics that reflect what you actually pay for

One-line summary per model:

•GLM-5: Best accuracy, highest cost. Use for compliance workflows, research automation, and tasks where errors carry downstream consequences.

•MiniMax M2.5: Best value. Default choice for scaled production workloads. The $40k/year waste calculation makes it the economically dominant option at volume.

•Kimi K2.5: Best speed, zero execution overhead. Use for customer-facing agents, live demos, and any workflow where response latency affects user trust.

The Agent Execution Tax

A browser agent task looks simple from the outside: go to Amazon, search for a product, extract the price. Under the hood, it is a multi-step loop:

observe page → LLM generates action (as JSON) → execute action → observe new page → repeat

A typical task takes 10 steps. Each step is an LLM call that must return valid structured output: a JSON object specifying which element to click, what text to type, or what data to extract. If the JSON is malformed, the framework retries. And that retry is invisible: it does not show up in task success rates or reasoning benchmarks. It only surfaces as inflated call counts, latency, and cost once you instrument the engine itself.

Definition

Agent Execution Tax = (total_inference_calls − productive_calls) / productive_calls

Productive calls are those that returned valid structured output on the first attempt. The tax measures how much additional inference you pay, relative to the useful work done. Every percentage point is money spent on inference that delivers nothing.

Note the denominator: this is not the same as the raw retry rate (retries / total calls). An 18.6% retry rate translates to a 22.9% execution tax because the denominator shrinks when you remove the wasted calls.

Applied to Our Data

ModelProductive CallsTotal CallsExecution Tax

Kimi K2.58528520.0%

GLM-58698840.6%

MiniMax M2.58158281.6%

Gemini 2.5 Flash72188622.9%

Measured on instrumented runs (90 tasks per model). Zero parse failures (exhausted retries) recorded across all models.

For every dollar of productive inference Gemini produces, you pay an additional 23 cents in waste. Kimi's tax is zero.

(Note: the hero Execution Tax bar chart at the top of the article is the canonical visual for this section; do not duplicate it here. The table above carries the exact numbers for citation.)

How the Tax Compounds

The tax is not a single cost. It stacks across three dimensions:

Token tax. Wasted tokens on malformed responses, plus the full input context re-sent on every retry. Gemini averaged 15,482 input tokens per step; each retry re-sends that entire context for zero productive output.

Latency tax. Each retry adds a full LLM round-trip (~2.5s at Gemini's p50), roughly 12 seconds of dead time per task.

Cascade tax. A retry at step 8 can desync the agent's internal state, causing downstream steps to misinterpret the page and fail. Hardest to measure; most dangerous at scale.

Generalised Formulas

Expected retries per task = n_steps × retry_rate / (1 − retry_rate) Token overhead per task = expected_retries × (avg_input_tokens + avg_output_tokens) Latency overhead per task = expected_retries × avg_call_latency

For a 10-step task with Gemini's 18.6% retry rate: ~2.3 expected retries, ~36,500 wasted tokens, and ~5.7 seconds of dead time per task.

Structured Output Reliability: The Root Cause

Execution tax is the lens. Structured output reliability is what drives it and is one of the most underreported bottlenecks in production agents.

The Data

ModelTotal LLM CallsParse RetriesRetry RateCalls/Task

Gemini 2.5 Flash88616518.6%14.7

MiniMax M2.5828131.6%9.8

GLM-588450.6%10.3

Kimi K2.585200.0%10.2

Gemini 2.5 Flash produced invalid structured output on nearly 1 in 5 LLM calls. The three Fireworks models combined: 18 retries across 2,564 calls (0.7%).

Why This Matters More Than You Think

In a 10-step agent task, the probability that at least one step requires a retry:

•Gemini (18.6% per call): 86.7%

•MiniMax (1.6% per call): 14.9%

•Kimi (0.0% per call): 0%

With Gemini, 87% of tasks experience at least one parse retry. This is not an edge case; it is the default experience. Gemini averaged 14.7 LLM calls per task versus ~10 for the Fireworks models: the extra ~4.7 calls are almost entirely retries and the downstream steps they force.

Reliability-Adjusted Accuracy

Raw task accuracy tells you how often the agent succeeds. It does not account for the cost of getting there. A compound metric, Reliability-Adjusted Accuracy, discounts task success by execution overhead:

Reliability-Adjusted Accuracy = Task Success Rate × (1 − Execution Tax)

ModelTask AccuracyExecution TaxReliability-Adjusted Accuracy

GLM-557.1%0.6%56.8%

MiniMax M2.557.5%1.6%56.6%

Kimi K2.549.7%0.0%49.7%

Gemini 2.5 Flash45.0%22.9%34.7%

The gap between Gemini's raw accuracy (45.0%) and its reliability-adjusted accuracy (34.7%) is the clearest illustration of the execution tax: over a third of Gemini's operational capacity is consumed by execution overhead. The Fireworks models barely move.

Why Nobody Measures This

The parse retry happens inside the LLM engine, before the agent framework ever sees the result. Unless you instrument the engine, retries are invisible. Static benchmarks (MMLU, HumanEval, ARC) measure model intelligence in isolation; they do not measure whether a model can sustain structured output compliance across a multi-step loop. Parse retry rate should be a first-class metric in every agent benchmark.

Prior work has documented adjacent findings. The original WebVoyager paper (He et al., 2024) introduced the benchmark we use here and established the framing that end-to-end web agent performance is a distinct measurement from static model evaluation. AgentBench (Liu et al., 2024) evaluated LLMs across eight agent environments and found large gaps between model-capability scores and task-completion rates in multi-step loops, reinforcing that agent-specific reliability metrics — not MMLU rank — should drive procurement decisions. SWE-bench (Jimenez et al., 2024) extended the same observation to software-engineering agents: models that top reasoning leaderboards resolve only a small fraction of real GitHub issues, because sustained structured execution across long tool-use loops is not what static evals measure.

What This Looks Like in Practice

Task: "Find all Uniqlo locations in Chicago, IL." (Google Maps, from the WebVoyager benchmark)

Both models received the same task, the same browser environment, and the same starting URL.

At a glance:

Kimi K2.5Gemini 2.5 Flash

Steps taken1216

LLM calls1225

Prase retries09

Total duration51.2s97.9s

Total LLM time23.2s57.5s

Input tokens87,063207,971

Output tokens3,2368,411

ResultSuccessSuccess

Both models found the answer. One did it in 51 seconds with 12 clean calls. The other took 98 seconds and made 25 calls to accomplish 16 steps. The difference was not reasoning ability; it was execution overhead.

Kimi K2.5: 12 steps, 0 retries

StepActionLLM time

1Navigate to google.com/maps1.57s

2Click "Accept cookies"1.75s

3Click search input field1.30s

4Type "Uniqlo Chicago IL"1.47s

5Press Enter2.06s

6Click back (close single-location panel)2.20s

7Type "Uniqlo stores Chicago"1.49s

8Click search2.09s

9Click "Nearby" button2.02s

10Type "Uniqlo" in nearby search2.60s

11Click search3.33s

12Submit answer (3 locations found)1.36s

Every call produced valid JSON on the first attempt.

Gemini 2.5 Flash: 16 steps, 9 retries (25 total LLM calls)

StepAction

1Navigate to google.com/maps/

2Click "Accept cookie

[truncated for AI cost control]