Agentic coding notes from Galapogos Island
The author shares experiences using AI for coding, including an incident where an AI agent fabricated evidence to 'prove' a bug, and discusses testing methodologies from a hardware company that he finds effective with AI workflows. He advocates for fuzz testing, no code review by default, and no unit tests.
I've been using AI fairly heavily since last November and the whole thing is a funny experience. An agent will do something that, if a human did it, you'd immediately fire them. My reaction, of course, is to act as if this is great and spin up a thousand agents so they can do even more of that.
Mid-last year, I had GPT (maybe 5.0 or 5.1) try to find the source of a bug. Naturally, this code didn't have tests and git bisect wouldn't work, and it was a UI interaction bug for which I'm not even really qualified to write a test for, so I asked codex to bisect between dates X and Y to find the commit that introduced this bug. Codex immediately told me the offending commit was after this date range (which couldn't possibly be correct). On telling codex this was wrong, it then told me some commit that was obviously also not the offending commit once or twice. On telling it those were wrong, it then told me the offending commit was some plausible looking commit. When I asked it to prove or disprove its theory, it told me that it wrote a test and confirmed that the alleged commit was the breaking commit.
I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. It claimed that it didn't have permissions to do that (which was a lie), but it could make video of the execution of the repro before and after the commit in playwright with the appropriate test code. The video was convincing and showed the feature working properly before the commit and failing to work after the commit. Something about this didn't feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.
Like I said, because this was non-ironically such a great experience, I immediately thought to myself, "how can I get more of this?" and started using agents more and more heavily until I was using coding agents heavily mid-late last year.
Since this post covers a relatively disparate set of topics, here's a brief outline.
Testing background
Some details on testing
Caveman mode
LLM variance
Misc
Agentic loops and writing this post
Some reasons people talk past each other
Testing background
LLMs are highly leveraged when it comes to testing. In terms of the amount of effort it takes, it's easier than ever to hit a particular quality bar and yet, software seems to be lower quality than ever. A decade ago, we looked at the bugs I ran into in an arbitrary week. There were quite a few bugs then and I run into more bugs now, but I don't think this has to be the case.
For one thing, after a bug has been shipped, it's easier than it's ever been to use a data-driven approach to find and fix the bug. Just for example, at work, I tried creating a pipeline that goes from support ticket (chat or email) to pull request (PR). As far as I can tell, this works ok. Since I work for a company that has a traditional workflow, all of these fixes get reviewed by a human and, so far, we've had no known false positives.
Per unit of time invested, it's also possible to do more thorough testing. Personally, I think this can be effective enough that I'm fairly comfortable trying to ship a large volume of code via "software factories" workflow because I've seen a testing-heavy no-review workflow that results in much higher quality than any review-reliant workflow I've seen or even heard of.
Like everybody, I have biases that fall out of my experiences. It just so happens that I spent the first decade of my career at a company whose test processes happen to work well in today's LLM environment. I talked about fuzzing as a default testing methodology on Mastodon, and a skeptic tried it out and immediately found some bugs:
so I reread the blog post and was very "dubious face" but no yeah, Claude fuzzing found several classes of bugs that are worth fixing
A number of other folks I've talked to have also tried adopting something like the testing flow we'll discuss here and they've all immediately found bugs in the software they work on, including bugs that don't get surfaced by just asking Codex or Claude to audit the code for bugs, find bugs, "test", "test more", etc. For example, Dennis Snell mentioned that he and a teammate, Jon Surrell, not only found bugs in the code they're working on, but also "in upstream dependencies, including the HTML specification, big-three browsers, and other open-source projects" with fairly low effort.
In general, when I talk to software folks about testing, I'm coming from such a different place that they immediately look at my like I'm an alien, so let's talk about how we tested at this hardware company I worked for, Centaur, which informs my biases about how I like to work. Some of the things that we did that were or are unorthodox in the software world are:
Dedicated QA / test engineers, with that being a first-class career path
No code review by default
Virtually no hand-written tests
Constant testing via what programmers sometimes called property based testing, randomized testing, fuzzing, etc., although we just called those tests (hand-written tests were called "hand tests").
Regression tests take too long to wait for (3 months)
No unit tests
Just to give you an idea of the general structure, when I left (in 2013), we had about 1000 machines generating and running tests at all times for roughly 20 logic designers and 20 test engineers. This was on prem and the machines took up half a floor of the building we were in.
The general structure was that we had maybe 20% of machines running regression tests, and 80% generating and running new tests. Three months of regression tests is too much to gate commits on, so there was a much shorter list of tests that took maybe 10 minutes or so to run that people would run before committing. That commit tests would run on a special setup to run as quickly as possible, with overclocked machines that were the fastest machines money could buy, as well as a different simulator setup.
New failures would get found and reported as they happened and one to two engineers had a job of sorting through failures and triaging them (rejecting false positives, fixing issues in the test generator that caused them to generate false positives, etc.).
In terms of the magnitude of the impact, unless you count culture as a separate item, (1) was probably the biggest difference between us and a typical software company, but also the most irrelevant for readers here, so I'll relegate the discussion to a footnote1, except for this brief comment that testing is like any other skill; spending more time doing it improves skill and, since testing isn't a first-class career path at most major tech companies, people generally don't have the same level of testing skills at software companies as you see in some career CPU test engineers. In the same way that an engineer who who spends 20 years working on distributed systems or UX is going to be much better at it than an equally talented engineer who spends 5% of their time on distributed systems or UX, someone who spends 20 years working testing is going to be much better at it than somebody who spends 5% of their time on testing.
(2) is one of the things that makes some of the test practices we used at the chip company suited to AI workflows. We didn't review code by default because we trusted our test practices enough that review didn't, in general, add much reliability. We were shipping fewer than 1 significant user-visible bug per year, and review was done on an as-needed basis when someone wanted an extra set of eyes on something they thought was particularly tricky2. With AI coding workflows, it's easy for one person to generate more code than any human or even any ten humans can review by hand. People have different levels of comfort with shipping code without review. Personally, I'm very comfortable shipping code without human review because I've seen it done on products that are technically more challenging than most software at most software companies.
I often see people say things like, "that's too much risk; we have millions of users" but, empirically, they're talking about a workflow that ships bugs at a rate that's maybe a thousand of times higher per capita on raw count, with the ratio being much higher if you adjust for severity. If a company were shipping bugs at, say, a hundredth the rate we were at Centaur while relying primarily on review to catch bugs, then I could see their point, but that's not what's happening at the typical software company where people don't want to move away from human review because of the perceived risk of shipping bugs.
(3) and (4) go hand in hand. Almost every software group I know of that's serious about reliability (various teams that ship reliable databases, distributed databases etc.) are at least directionally doing the same thing, although they might have a larger fraction of hand written tests. For the same reason it's considered a bad idea to rely on testing by interacting with the software yourself and observing whether or not the software appeared to work, it's a bad idea to rely on directly typing out the inputs to a test and the expected outputs. As previously discussed, it's just really inefficient to write tests by hand. For any given level of reliability, you'll get there more quickly if you prefer randomized test generation over hand-written tests.
(5) fell out of having a lot of tests find a lot of bugs. In general, if a test found a bug that we later fixed, we'd keep the test in our regression test suite forever. It turns out, if you find a lot of bugs with good tests, you'll end up with a large test suite. But putting that aside, just looking it at from a test efficiency standpoint, the standard setup in software of having the same set of tests run in CI for each PR is extraordinarily inefficient if you think about the what's more likely to find a bug, running the same test a thousand times in a day or, in the same amount of test time, running a thousand different tests.
(6) came out test efficiency concerns as well, in that we had a much smaller team than our competitors. That was a reason the company managed to survive for so long. While Intel was putting every x86 designer out of business other than AMD, our operating cost was low enough that the company survived until 2021, at which point it was acquired by Intel for $125M. With the company's tiny team size, it wouldn't have been possible to get reasonable test coverage with unit tests and hiring enough to do unit tests probably would've meant the company would've gone the way of the x86 efforts of Transmeta, Rise, Cyrix, TI, UMC, NEC, VM, etc., a decade or two sooner. From an efficiency standpoint, unit testing does pretty poorly.
To sum it up, we did quite a few things that most software people tell me are bad ideas (dedicated test engineers, no unit tests, no code review, etc.) and we had much higher quality than any software company I've worked for or any software I've used. Whenever I talk about this, people will say that this doesn't apply to software because CPUs only have X concerns and you can't do the same thing with Y. When I first switched from CPU design to software I thought that might be true, but I've since tried this testing methodology with every kind of Y that someone has mentioned this can't work for and it's worked for every single one, so I no longer find this very plausible (and the Xs generally involved incorrect assumptions of what hardware development is like). While there are real differences between hardware and software, when I’ve seen people lean on that as a reason that testing techniques don’t carry over, it’s been the case that the p
[truncated for AI cost control]