2026-06-26 15:50 UTCIn-site rewrite7 min readUpdated: 2026-06-26 16:14 UTC

Agentic Code Review

As AI coding agents become extremely proficient, the bottleneck has shifted from writing code to reviewing it. Data shows a dramatic increase in code churn, defects, and review time. The key is to adapt review processes based on the context: blast radius, code longevity, and team size. Capturing agent reasoning can alleviate review burden.

SourceO'Reilly AI & ML RadarAuthor: Addy Osmani

The following article originally appeared on Addy Osmani’s blog site and is being republished here with the author’s permission.

Coding agents are extraordinarily good now, and getting better fast. The interesting consequence is that the hard part of engineering moved from writing code to deciding whether to trust it, which makes review the most leveraged skill in software right now. How you approach it depends enormously on who you are: A solo developer with no users and a team maintaining a 10-year-old application are not solving the same problem.

I am more optimistic about agentic engineering than I have ever been. The agents are genuinely good, they get better every month, and on an ordinary day I now ship things I would not have attempted a year ago. This write-up is a map of where the interesting work went, because it did move, and most teams have not fully caught up to where.

Code review used to work because of a happy accident of relative speed. A senior engineer could read code faster than a junior could write it, so review kept pace without anyone designing it to, and the team absorbed how the system fit together as a side effect of reading each other’s diffs. A lot of that was not deliberate. It fell out of a single fact: Writing code was the slow, expensive part, and reading it was cheap and fast.

That fact no longer holds. An agent will produce a thousand lines of often solid, well-formatted code in less time than it takes me to read this paragraph, while a human’s reading speed has not changed since roughly the day we started staring at screens for a living. So the constraint moved downstream, to the one step that did not get faster: a person being confident the change is right. I don’t think that’s a loss. It’s the most leveraged place in software to be good right now, and it’s where I’ve put most of my attention this year.

There’s a happy twist here that shapes the rest of this piece. The same tools generating all that extra code are also the best thing I have for keeping up with it. On my own projects, including the popular open source ones, I now point Claude Code or Codex at a batch of incoming PRs and have them triage the queue for me, and that has genuinely changed how I spend my time. So this is not an anti-AI argument, and I will come back to exactly how I use AI.

It’s also not a data dump, and not another round of whether letting a model write your code is wonderful or the end of the craft, because that framing is useless. The only answer that survives contact with a real codebase is that it depends entirely on who you are. A developer vibe-coding a side project only a dozen people will ever run and a team keeping a 10-year-old enterprise system alive for another quarter share almost no constraints worth naming, and most of the advice in circulation is really one of those two people telling the other how to live.

What the 2026 data actually shows

The productivity gains from AI are real, but raw output overstates them: about four times the code for a tenth more delivered value. The gap between those numbers is review work, which is exactly why review is where the leverage now sits.

For a couple of years this was an anecdotal argument. It’s now measured at scale, by organizations with no shared agenda and in several cases competing commercial interests, and the measurements keep pointing the same way: AI pushes output sharply up and pushes both quality and reviewability down.

Faros AI instrumented 22,000 developers across 4,000 teams and tracked what happened as teams moved from low to high AI adoption. This is March 2026 data, about as current as anything here. The upside is real. Developers merge considerably more PRs and complete more work and throughput per engineer climbs. Then the rest of the report:

Code churn is up 861%.

The incidents-to-PR ratio is up 242.7%.

The per-developer defect rate is up from 9% to 54%.

Median review duration is up 441.5%, with time to first review and average review time both roughly doubling.

PRs merged with zero review are up 31.3%.

The last figure is the one I find hardest to dismiss, because nobody chose to stop reviewing. Reviewers simply couldn’t keep pace with the volume, so code began merging unread, and that became normal. The detail I keep returning to is that teams with mature, disciplined engineering practices were hit just as hard as everyone else. Good process didn’t protect them, because the volume arrived faster than any process was designed to absorb.

CodeRabbit studied 470 open source PRs in December 2025, 320 AI-coauthored and 150 human-only, and found the AI changes carried roughly 1.7x more issues. Logic and correctness problems were up about 75%, security issues were 1.5 to 2x more common, and readability problems more than tripled. The company’s AI director, David Loker, described these as “predictable, measurable weaknesses that organizations must actively mitigate.” Predictable is the operative word. These are known, locatable weaknesses, which is good news: It means a review process, human or automated, can be aimed straight at them.

One caveat to hold throughout: CodeRabbit and Faros both sell into this market, so their framing is not disinterested. That doesn’t make the numbers wrong—the effect sizes are large and consistent across unrelated sources—but vendor research deserves to be read with that in mind.

GitClear has the single number I would lead with. In its productivity data through 2025, daily AI users produce around 4x the raw output of nonusers, but measured against their own output a year earlier, the real productivity gain is only about 12%. You’re generating roughly four times the code for something like a tenth more delivered value, and a human still has to review all of it. To GitClear’s credit, CEO Bill Harding is explicit that some of even that 12% is selection bias, because stronger developers are concentrated in the AI cohort.

GitHub reports that Copilot review has now run over 60 million reviews, a 10x increase in under a year, and more than one in five reviews on the platform involves an agent. This is no longer a niche practice. It’s how code gets made.

Four datasets, four methods, one conclusion. We poured machine-speed output into a system built for human-speed work. The bottleneck didn’t disappear; it moved to verification, and review is where that bill comes due.

Everyone is solving a different problem

How much review a change needs depends almost entirely on its blast radius, and most advice you read was written by someone operating for a very different one.

Almost all the alarming data above comes from enterprise telemetry and from open source maintainers being overwhelmed. It’s entirely real if that is your situation. If you’re one person shipping something a handful of people will ever run, much of it simply doesn’t apply to you, and you shouldn’t be made to feel otherwise.

Three variables determine where you sit:

Blast radius: What happens when it breaks? Nothing, or angry users and money and PII on the line?

How long the code lives: A throwaway prototype you might rewrite next week, or a codebase you’ll maintain for years?

How many people need to understand it: Just you holding the whole thing in your head, or a team that has to share ownership over time?

Run the same diff through those three variables, and “good review” means genuinely different things.

If you’re working solo on a greenfield project with no users, review’s second job, distributing knowledge across a team, doesn’t exist for you. You are the team. The reasonable move is to lean hard on tests and automation, review the parts that genuinely matter, and accept a lighter touch on the rest. Duplication and churn cost far less when the code may not exist in a month and nobody is paged at 3:00am when it breaks. The catch, and people learn this one painfully, is that it only works if the tests are real. Skipping review without a safety net doesn’t remove the work. It defers it at a higher price, and standards slip when no one is there to push back. “No users” is permission to defer review. It isn’t permission to skip verification.

Then the project gets users. This is the dangerous middle, and the crossing is rarely noticed at the time. Review’s bug-catching role suddenly matters, because bugs now hurt people, and its knowledge-sharing role switches on, because it’s no longer only you. Teams keep their solo-era habits a few months too long, and then there’s a postmortem and the Faros numbers stop being a chart and become their own dashboard.

At the far end is the large organization with an old codebase and many users. Here every alarming figure lands at full strength. A duplicated helper isn’t a style nit; it’s a future bug surface and a maintenance cost that compounds for years. A change nobody understood is comprehension debt that becomes someone’s on-call incident. Review is doing several jobs at once, and the volume of agent output quietly breaks all of them. The Faros finding about mature teams is aimed squarely here.

So the point is not “Enterprises should be cautious and solo developers can relax.” It’s that the purpose of review changes with your position, so the rules have to change with it. Bolt an enterprise’s locked-down multi-agent evidence-required pipeline onto a two-person prototype and you’ve added friction for no benefit. Run “tests pass, ship it” on a payments system and you’ve built an incident generator with a green checkmark on top. Most bad advice in this space is one position on that spectrum prescribing to another.

What review is actually for now

Review was built to check an author’s reasoning. An agent does reason, but that reasoning is usually thrown away rather than attached to the code, so the reviewer has to reconstruct a rationale that never made it into the diff. The good news is that this is a tooling problem, and capturing the reasoning makes review dramatically easier.

This is the part that genuinely changed, and I think it is underappreciated.

When a human writes code, intent comes along for free. The reasoning, the alternatives weighed and discarded, lived in the author’s head, and review was you checking that reasoning. Modern agents do reason, often visibly, producing thinking traces and weighing options and explaining themselves as they go. The catch is that this reasoning is usually discarded the moment the diff is produced. It’s rarely captured and rarely attached to the PR, and in any case it is the agent’s reasoning about how to implement the task, not a human’s judgment about whether it was the right task to begin with. So review shifts from checking reasoning that sits in front of you to reconstructing intent that never got written down, which is harder and slower, and we keep acting surprised that it takes 441% longer.

A 2026 paper, “AI Slop and the Software Commons,” analyzed 1,154 posts across 15 Reddit and Hacker News threads where developers discussed “AI slop.” One line from a developer has stayed with me: reviewing an agent’s PR made them “the first human being to ever lay eyes on this code.”

That sentiment points straight at the fix. In normal review, the author already understood the change and you were checking their work. With an agent PR, nobody has reconstructed the why yet, and the reviewer is the first to try. As the paper puts it, review “wasn’t built to recover missing intent.” The encouraging part is that missing intent is recoverable: The reasoning existed; we just discarded it. Have the agent state what it was trying to do and what it ruled out, then capture it as a decision log on the PR, and a large part of the reconstruction cost disappears. This is a tooling problem, and tooling problems get solved.

None of which makes “have the AI review the AI” a complete answer on its own. A second model with different priors genuinely catches real bugs, and it catches a lo

[truncated for AI cost control]