2026-06-24 18:06 UTCIn-site rewrite6 min readUpdated: 2026-06-24 18:11 UTC

AI coding agents need evidence-first review, not just cheaper routing

This article argues that for AI-assisted coding, model call costs are only a small fraction of total engineering decision cost, with human review and rework being the true bottleneck. It compares routing, agentic RAG, multi-model deliberation, and automated testing, and advocates for a verification layer that connects claims to evidence, narrowing the review search space. It also quantifies when extra verification pays off.

SourceHacker News AIAuthor: CalmAngler

In many AI-assisted workflows, code generation is no longer the only bottleneck. Assistants read repositories, edit files, run commands, and write tests. Agentic systems plan, call tools, retrieve more context, and assemble an answer over several steps or several models.

What was actually checked, what did the model merely assume, and how much of this result can I rely on before merge?

Producing plausible code has become cheaper. Checking its foundations has not necessarily followed. Comparing AI tools only by token price, generation speed, or agent count misses the engineering decision that matters: the path from a request to a justified merge decision.

This article asks three questions:

Does AI reduce total decision cost once calls, review, rework, and escaped-error risk are counted?

Which part of that cost is targeted by routing, retrieval, multi-model deliberation, and automated checks?

What should a verification layer produce, and how can its value be falsified rather than merely claimed?

The verification tax

The productivity evidence is mixed. METR ran a randomized controlled trial with 16 experienced open-source developers performing 246 real tasks in mature repositories they knew well, using early-2025 tooling. With AI, tasks took 19% longer on average [1].

In February 2026, METR reported that newer data probably shows a larger uplift, but explicitly called the signal unreliable. The raw estimate for returning developers was -18% change in completion time with a confidence interval of [-38%, +9%]; for newly recruited developers it was -4% with [-15%, +9%], where negative means speedup. Both intervals include zero effect [2].

The honest conclusion is neither “AI always speeds developers up” nor “AI always slows them down.” Productivity depends on tool maturity, repository familiarity, task shape, context acquisition, and the cost of checking the result.

The 2025 DORA report provides a different, observational view of nearly 5,000 technology professionals: 90% use AI at work, more than 80% perceive a productivity gain, but 30% have little or no trust in AI-generated code. AI adoption is positively associated with delivery throughput and product performance and negatively associated with delivery stability [9]. This is not a causal estimate. It is consistent with a systems hypothesis: faster local generation may increase downstream load if testing and delivery controls do not scale with change volume.

A synthesis of seven Google studies found that 39% of external developers trust GenAI output quality only slightly or not at all. Perceived rigor of review and testing, and developer control over where AI is used, were positively associated with trust [7].

Review itself is not only defect-finding. In Bacchelli and Bird’s study of 200 Microsoft review threads and 570 comments, code improvements accounted for 29% of comments and defects for 14%. The authors identify understanding the context and the change as central to review and record knowledge transfer as an outcome in its own right [3].

An illustrative review-load model

Assume a team handles 20 PRs per week and an average review takes 30 minutes:

20 PR × 0.5 h = 10 reviewer-hours / week

If AI doubles throughput while review cost per PR stays fixed:

40 PR × 0.5 h = 20 reviewer-hours / week

If AI-assisted PRs become wider and review time rises by 25%:

40 PR × 0.625 h = 25 reviewer-hours / week

ScenarioPR/wkReview/PRReview load

Pre-AI2030 min10 h

2× throughput4030 min20 h

2× throughput + wider PRs4037.5 min25 h

This is a sensitivity model, not a market statistic. It shows the mechanism: faster generation may move work from writing to checking rather than remove it.

The total cost of an engineering decision

The token bill is not the total cost. Define the expected cost of one decision:

C_total = C_model + C_tools + R_hour × (T_review + T_rework) + P_escape × L_escape

C_model: model calls;

C_tools: CI, sandbox, retrieval, and other compute;

R_hour: internal cost of one engineering hour;

T_review: time to an apply/review/reject decision;

T_rework: expected time to fix issues found before merge;

P_escape: probability that a material error passes review;

L_escape: expected loss from such an escape.

Take an illustrative baseline: C_model = $5, review takes 60 minutes, and R_hour = $80. Set tools, rework, and risk aside temporarily:

C_total = $5 + $80 = $85

The ceiling on pure model-bill optimization

If model calls are a fraction f = C_model / C_total, then optimizing only the model bill while holding workload, quality, review, rework, and risk fixed lowers C_total by at most f. At the reference numbers:

f = 5 / 85 = 5.9%

This is not a ceiling on routing’s total effect. A weaker cheap model may raise retries, T_rework, and P_escape; a good router may cut latency and failed calls. It is an accounting observation: when the model bill is a small part of the total, optimizing that line alone cannot solve a review-bound bottleneck.

Cutting review from 60 to 40 minutes produces a different scale of change:

C_total = $5 + $80 × (40/60) = $58.33 Saving = ($85 - $58.33) / $85 = 31.4%

ChangeModelReviewC_totalSaving

Baseline$5.00$80.00$85.00—

Model calls halved$2.50$80.00$82.502.9%

Review 60→40 min$5.00$53.33$58.3331.4%

Both$2.50$53.33$55.8334.3%

In autonomous agentic loops with little human oversight, f may be large and routing can become the main economic lever. In workflows constrained by costly human review, f is lower. The relevant question is which term actually dominates the total cost.

Different systems control different parts of the cost

Modern AI systems often look similar: agents, orchestration, retrieval, a judge, and synthesis. Similar shape does not imply the same job.

Routing: Kilo Gateway and RouteLLM

Kilo exposes an OpenAI-compatible endpoint, access to many models, BYOK, usage tracking, spend limits, and organization controls [11]. ByteByteGo describes routing on a known mode — planning, coding, debugging — with user-selected tiers and a server-updated model map. The reported Kilo figures — roughly one-third lower average request cost, 80–90% of requests not requiring frontier models, a greater-than-10× tier gap, and an estimated $87K quarterly overspend from misrouting routine traffic — are vendor-reported and not independently verified [8].

An idealized model shows the potential scale:

relative_cost = 0.15 × 1 + 0.85 × 0.10 = 0.235 relative reduction = 1 - 0.235 = 76.5%

RouteLLM provides primary research evidence for the trade-off: a 3.66× cost-saving ratio at 95% of GPT-4’s MT-Bench score for a GPT-4/Mixtral-8×7B pair, equivalent to 72.7% relative cost reduction [12]. Its cost model uses short single-turn prompts and benchmark score as quality. It is not a coding-agent loop or evidence that a repository change is safe.

Agentic RAG: sufficient context

Google describes a multi-agent RAG with a dedicated Sufficient Context Agent. It compares the query, retrieved snippets, and a draft, names missing information, and can trigger another retrieval pass. Google reports up to 34% higher accuracy than standard RAG on factuality datasets [4].

The Sufficient Context research exposes a broader failure mode: models often answer incorrectly rather than abstain when context is insufficient. Guided abstention improved correctness among answered cases by 2–10% for Gemini, GPT, and Gemma [5].

This supports a sufficient-context loop, but it is not a measured reduction in T_rework or P_escape for software development. A codebase is not merely a document corpus; it contains runtime behavior, callers, invariants, and migrations.

Multi-model deliberation: consensus is not proof

OpenRouter Fusion runs a parallel panel of 1–8 models. A judge returns a structured comparison of consensus, contradictions, partial coverage, unique insights, and blind spots; a final model writes the answer. The documentation describes the pipeline but does not provide an independent effectiveness benchmark [10].

Google Research compared 180 agent configurations. Independent topology amplified errors by up to 17.2×, while centralized coordination held amplification to 4.4×. Multi-agent improved the parallelizable Finance-Agent result by 80.9%, but every multi-agent variant degraded the sequential PlanCraft result by 39–70%. The authors’ predictive model selected the optimal architecture for 87% of unseen configurations [6].

This evaluation did not contain repository code review. The narrower engineering hypothesis is that value depends on topology, task decomposability, a centralized gate, and evidence handoffs — not on agent count alone.

Tests and static analysis

SAST, DAST, CodeQL, Semgrep, unit tests, and mutation tests provide repeatable checks of explicitly encoded properties under controlled inputs, configuration, and environment. Their quality is bounded by coverage, false positives, false negatives, and flakiness.

They are necessary, but do not always reveal that a model never opened the relevant file, built a conclusion on a false assumption, or tested an implementation detail instead of a system invariant. Green checks are not proof of complete intent.

Side by side

ApproachPrimary problemUnit of decisionMain outputDoes not solve by itself

Kilo / routingModel access, cost, policyModel requestCompletion + cost dataTrust in an engineering change

Agentic RAGIncomplete contextContext sufficiencyGrounded answerPatch safety and codebase invariants

Fusion / multi-modelFragility of one answerAgreement/disagreementConsensus + contradictionsFactual checking of repository claims

Tests / staticFormalizable propertiesTest/rule resultPass/fail + diagnosticsIntent, assumptions, completeness

Verification artifactHidden checking areaMerge decisionEvidence boundaries + verdictA correctness guarantee

These systems are not necessarily direct competitors. Routing manages model-call cost. Agentic RAG tests context sufficiency. Multi-model deliberation surfaces disagreement. Tests check formalized properties. A verification artifact should connect those signals to a decision about how far a candidate is supported.

Trust debt and hidden checking work

Suppose an engineering answer contains a set of material claims:

C = {c1, c2, ..., cn}

For each claim, a reviewer needs to know whether it is supported by evidence, contradicted, or still an assumption. A rough diagnostic metric is:

evidence_coverage = supported_claims / total_material_claims

If an answer contains 20 material claims and sufficient evidence exists for 12:

evidence_coverage = 12 / 20 = 60%

The remaining 40% are not necessarily wrong. They are the area a reviewer still needs to inspect. If a tool does not expose that area, the engineer first has to discover it and only then verify it. That is hidden verification work.

The goal of a verification layer is not to declare an answer absolutely correct. It is to:

connect material claims to checkable evidence;

expose relevant targets that were and were not inspected;

separate assumptions from supported conclusions;

preserve critique and rejected hypotheses;

surface open production and PR risks;

narrow the manual search area without hiding uncertainty.

Review remains. The search area should become smaller.

When extra verification pays for itself

Ignoring risk for a moment, an extra check costing ΔC pays for itself when it saves at least T_break_even = ΔC / R_hour. At R_hour = $80:

Extra cost/runRequired review saving

$21.5 min

$53.75 min

$107.5 min

$2015 min

Reducing P_escape by 0.1 percentage point — from 1.0% to 0.9% — at L_escape = $10,000 yields:

(0.010 - 0.009) × $10,000 = $10 expected saving per run

L_escapeSaving/runSaving/month at 100 runs

$1,000$1$100

$10,000$10$1,000

$100,000$100$10,000

$1,000,000$1,000$100,000

This is an expected-loss model, not a measured product outcome and not literal insurance. Expensive v

[truncated for AI cost control]