2026-06-04 07:31 UTCIn-site rewrite6 min readUpdated: 2026-06-30 13:03 UTC

Where AI agents pay off

This article explores the real-world return on investment from AI agents, particularly for individuals and small teams. The author argues that leverage comes from parallelizing bounded execution loops with tight feedback loops, not from replacing humans. Key insights include the importance of system design (model, harness, tools, environment, evaluator), the value of manual testing, and the advantage small teams have due to short feedback cycles. The article warns of the 'Sloptember' failure mode where agents increase the volume of mediocre work without improving quality.

SourceHacker News AIAuthor: ricokahler

Where AI Agents Actually Pay Off

Posted June 4, 2026

I am starting to get real leverage from AI agents.

Not theoretical leverage. Not "look, the chatbot wrote a function" leverage. I mean the kind where a messy voice note turns into a draft, a repo change, a test, a pull request, a live fix, a follow-up task, and a breadcrumb that gives the next agent more context.

That leverage is exciting. It is also a little cursed.

The cursed part is not that the models are secretly alive or that software engineers are all immediately obsolete. The cursed part is more boring and more important: the economics are starting to work in weird places, especially for individuals and very small teams, and they do not work everywhere. The window is small. The workflow changes are nontrivial. The token bill can get gross fast. And if you do not build the surrounding system, agents can easily become an expensive way to generate unfinishedness.

This is where I think most agent discourse gets a little too smooth. People ask "is AI faster?" as if there is one answer.

There is not.

Sometimes it is slower. Sometimes the model churns. Sometimes the first answer is plausible but wrong. Sometimes the agent burns twenty minutes going in the wrong direction.

But the interesting question is not whether one agent is always faster than one human on one task. The interesting question is:

What happens when a human can specify, run, review, and improve many bounded execution loops in parallel?

That is where the ROI starts showing up.

It is also where the danger starts showing up.

George Hotz wrote the sharp negative version of this in "The Eternal Sloptember". His argument, as I read it, is not just "AI code bad." It is that agent output frontloads the impressive part, leaves the hard polish and coherence work to the human, and produces artifacts that are broken in ways old quality proxies do not catch anymore.

I do not fully buy the permanent claim that agents cannot program. I do buy the organizational warning. If your feedback loops are slow and your average worker is not carefully reading and error-correcting the output, agents can raise the volume of mediocre work faster than they raise the quality of good work.

That distinction matters. The question is not "agents: yes or no?" The question is "who can absorb the leverage without degrading their own system?"

The ROI Is A System Property

The useful unit is not "the model."

The useful unit is the whole system:

Capability = model x harness x tools x environment x evaluator

The model matters. Obviously. A stronger model listens better, repairs better, and survives ambiguity better. GPT-5.5, in particular, has felt like a genuinely good foundational engineering model in my current workflow. It is often good enough that I can hand it a real codebase, a weird constraint, and a fuzzy product taste problem, then get back something I can review instead of something I have to babysit from first principles.

The annoying wrinkle is that models are not good in one global way. Some cloud/chat models feel much better at one-shot apps, UX exploration, visual design, and frontend taste. Codex/GPT-5.5 feels more steerable for deep repo engineering, but it can be pretty rough by default on product polish. That is not a contradiction. It is routing. Different tasks want different model/harness/tool combinations.

But the model is not the product.

The harness matters. Can it read the repo? Can it run tests? Can it browse current docs? Can it keep a plan? Can it spawn parallel work safely? Can it preserve local changes it did not make? Can it say clearly when it is blocked?

The tools matter. A model with a terminal, browser, GitHub access, docs, image inspection, and a real test suite is a different creature from the same model in a textbox. Tool access changes the shape of cognition because the agent can externalize uncertainty into the world: read the file, run the command, inspect the screenshot, check the deployed page.

The environment matters. A legible repo is agent fuel. Good scripts are agent fuel. Clear boundaries are agent fuel. Stable design primitives, typed connectors, preview/apply workflows, and boring test commands are all forms of intelligence that do not live in the model weights.

And the evaluator matters most of all. A task becomes delegable when there is a way to tell whether it worked.

Typecheck. Test. Build. Screenshot. Read back the external system. Ask a human to review a tight diff. Run an eval. Compare against a rubric. Verify the live URL. Whatever. Without an evaluator, the agent is not really operating. It is describing completion instead of proving it.

Manual Testing Is Underrated

The best agent workflows I have found are not the most autonomous ones. They are the ones with the tightest feedback loops.

Manual testing is underrated here. So is manual tasking.

People sometimes treat manual intervention as failure, as if the agent only counts if it runs alone and returns with a perfect artifact. That is the wrong fantasy. The fastest path is often:

Ask for a bounded change.

Let the agent inspect, edit, and test.

Manually poke the thing.

Notice the failure.

Make the agent fix it.

Turn the failure into a durable guardrail.

The last step is the compounding step.

If I manually catch a bug and only fix that bug, I got one fix. If I catch a bug and then add a test, a lint rule, a PR gate, a repo instruction, a skill, or an eval, I changed the future working conditions. Every later agent now has a slightly narrower path to repeat the same mistake.

This sounds obvious, but it is the difference between "using AI" and building an agentic work system.

For example, in one repo I added a PR compliance pattern that is almost comically literal: repository skills contain attestation words, and the agent has to include the current words in the PR body to prove it read the relevant instructions. The CI gate checks the JSON. If the branch changes, the head SHA has to be updated. If the agent tries to hand-wave the process, the gate fails.

It is silly.

It works.

And that is the point. You do not need the model to become careful by default. You need the environment to make the desired behavior easier to do than to skip.

Parallelism Is Not Just Splitting A Project

In serial, agents are often not as magical as people want them to be.

If I sit and watch one agent do one thing, I still have to wait. I still have to review. I still have to catch drift. I still have to close the loop. Sometimes I could have done the task myself faster.

The return starts to make sense when the work can run in parallel.

But "parallel" means two different things.

The first is normal decomposition. Some goals are naturally splittable: add several model providers, support several import paths, fix a cluster of bounded bugs, smoke test several integrations, write the plan while another branch works on the primitive. In those cases, the move is to write the map, split the slices, give each slice a narrow success condition, and periodically re-ground in main.

This is where planning documents become more useful than they sound. A good plan is shared state. It tells future agents what exists, what is blocked, what should merge first, and what "done" actually means.

A model-provider push is a good example. The goal was not "have agents do provider stuff." The goal was to make additional providers usable, cheap enough to matter, and provable through the real product path. That split into capability research, shared adapter work, provider-specific implementation, smoke tests, and an integration pass that checked what was actually merged, deployed, and usable.

That last part matters. A branch can be merged and still not be done. A local smoke test can pass and still not mean the product works. Sometimes success has to mean a real production turn, through the real auth path, with enough output to prove the provider is not merely returning a polite error.

The second kind of parallelism is less clean and more honest: working on more than one thing because the agent is busy and I am sitting there.

I am literally dictating parts of this article while other agents are fixing other things. Some of those things are related. Some are not. I am doing it because I am bored waiting for loops to finish, because I am anxious, because the queue is there, and because if I care about maximizing my own output the incentive is obvious: keep useful work in flight.

That is not the same as one project neatly split into ten slices. It is a more ambient multiplexing of attention. While one thread builds, another reviews, a third researches docs, a fourth waits on CI, and I use the dead air to think about the next thing.

This changes the latency math. If I have one task running, the difference between 20 minutes and 40 minutes is painful. If I have several bounded loops running and my real bottleneck is review, merging, and deciding what to queue next, the difference matters less. Not zero. But less.

The job becomes orchestration: what is running, what is worth checking now, what can wait, what needs to be killed, what should become a primitive, and what should merge before another branch drifts.

That does not mean "start ten random branches and vibe." It means keeping enough explicit state that parallel work stays reviewable instead of becoming an unclear set of branches with unclear ownership.

The Small-Team Ownership Window

This is the part I keep coming back to: agents may be a much better deal for a small number of high-agency people than for the average large org.

Large organizations have advantages: money, distribution, legal cover, procurement, internal data, and teams of specialists. But they also have slow feedback loops. The person prompting may not own the architecture. The person reviewing may not understand the product context. The person paying the token bill may not see the cleanup burden. The person measuring productivity may count output instead of coherence.

That is how you get the Sloptember failure mode: more code, more features, more artifacts, more surface area, and less understanding.

Small teams have a different advantage. The loop can be brutally short:

I feel a roadblock.

I decide whether the roadblock is recurring.

I build or ask an agent to build the primitive that removes it.

I manually test the new path.

I use the improved path immediately on the next task.

That loop is hard to buy with headcount.

It is also why the $200 tier is not just a pricing detail. For an individual or tiny shop, a heavy consumer subscription can feel like access to an absurd amount of subsidized frontier-ish execution. I can burn through most of a weekly Codex allowance, keep thinking, keep delegating, and keep building. Inside a big company, that same behavior may be blocked by policy, data rules, vendor approval, or simply the fact that the enterprise has to pay usage-based prices for every team.

So there is a weird temporary arbitrage here. Individuals can sometimes get something that looks like enterprise execution capacity before enterprises can comfortably operationalize it.

But it only works for a narrow class of people and teams. You need taste. You need error correction. You need enough technical depth to know when the agent is wrong even though the output sounds confident. You need enough product judgment to know when not to run another branch. You need enough executive function scaffolding to remember what is already running.

This is not "AI makes everyone 10x."

It is more like: AI lets some people build a little execution machine around their own judgment, if they are willing to do the practical work of making that machine reliable.

That matters personally because "just get a job" does not feel like the stable fallback it is supposed to be. I have applied for roles where I thought I was a real fit, gotten th

[truncated for AI cost control]