How to Stop Babysitting AI Code
LLM agents make feature development cheap but introduce architectural drift. By separating architectural decisions from implementation and enforcing rules via build system checks, developers can reduce the burden of reviewing generated code and focus on system design.
LLM agents have made feature work feel almost cheap. You ask for a change, the agent writes code, adds tests and runs them. For a moment, the future looks like vibe coding.
Then the receipts arrive.
The agent imported from the nearest useful path and gave frontend code direct access to the database. That choice made the feature easier to implement, but it let short-term convenience rewrite the architecture.
The obvious next move is to write the rule down. AGENTS.md files, specs and planning documents all try to load architecture into the agent before it writes code.
They help, but prose is still a reminder, not a constraint. The agent can forget it, obey the part that is convenient or claim success without satisfying the rule.
Review became the backstop, which meant checking every diff for shortcuts like that. It helped, but it turned development into a tedious supervision job.
The way out was to separate architectural decisions from implementation details, then add checks that force the implementation to respect those decisions. Once that was in place, review shifted away from generated plumbing and back toward long-lived architectural choices like module boundaries, public APIs and dependencies.
Rules Need a Place to Live
The deeper problem is memory. Humans carry social context forward. Agents need that context reloaded or enforced every run. They can break the same written rule ten times and still treat the eleventh reminder as new information. It’s like hiring a very fast intern who resets every morning.
Documentation is useful context, but a poor home for rules the project cannot afford to break. Context is scarce. Instructions dilute each other. If a rule must hold every time, it should live somewhere the codebase can enforce it.
The Cost Curve Flipped
Software teams already use checks. Important rules leave prose behind and become types, tests, linters, build checks or runtime validation. Teams usually promoted a rule only when violations were common or enforcement was important enough to justify the work. Even then, they reached for standard lint presets and checks that were cheap to configure.
A custom checker that understands your module layout, your import boundaries, your idea of what a public surface looks like would require significant engineering effort upfront for questionable reward. Standard lint presets and code reviews as a quality backstop were often the better tradeoff.
Coding agents changed the economics in both directions.
They generate drift faster than humans can review it.
They make bespoke checkers cheap enough to prototype in an afternoon.
The source of the drift now helps contain it.
Here is a small example. Agents love to wrap simple arguments in ceremony.
Instead of:load(id: string)
you get a tiny application form like:load(input: { id: string })
Telling the agent “don’t do that” in an instructions file works for a while. Asking the agent to write a check that rejects that pattern in public interfaces took me minutes. From then on, the check handled the reminder and the build failed until the agent fixed it.
The pattern easily scales to rules that actually matter. Frontend must not import the database schema? That’s an import boundary check. Don’t want your validation logic to live in UI? That’s a source check. Once a rule is a check, the failure is deterministic, the error message names the rule and the agent repairs its own violation before I ever see the diff.
Contracts Outside, Generated Code Inside
In my project, local rules become checks in the build system. They live outside the package source and are grouped by the code they govern, such as UI primitives, UI components and domain packages.
For this example, let’s look at the domain package. It is mostly plain TypeScript code that carries the business logic for my project. The logic is split into small modules with narrow public surfaces. Each module follows an enforced structure.
some-module/Source files split by review responsibility.
Generated
regenerate when authoritative files change
impl/ implementation
... agent-written code
Review starts with the authoritative files. Generated code stays inside impl/**.
The structure is one rule among several. A module must contain the expected files and each file must obey the rules for its role.
The architectural decisions live in the authoritative files. manifest.json, README.md and contract/** define the module’s intent, public surface, dependencies and behavior examples.
The implementation details stay inside impl/**. They must satisfy the contract and stay inside the import boundaries.
Changes to authoritative files can change the project’s shape. They also require impl/ to be regenerated. Changes inside impl/ should stay local to the module. If they escape that boundary, a check fails.
The checks live in the build system, outside the module, where they enforce those roles.
Spec System Checks
interface.ts
public API only, no implementation
create.ts
expose the module constructor
declare every required dependency
cases.ts
define behavior examples as tests
build the module through create.ts
pass against the real module with fake dependencies
impl/**
may import only create.ts and interface.ts
may not use ambient capabilities such as Date or crypto
The spec system checks live in the build system, outside some-module/.
The goal is to make impl/** a closed box. If the contract files are well specified, generated code can still be wrong, but the mess stays inside the implementation directory.
This makes it possible to keep review focused on the authoritative files, where the lasting decisions live.
I prefer regenerating impl/** to patching it. Patches preserve traces of the previous design. They also give the agent more chances to reconcile old assumptions with new ones. Regeneration works best when each module is small enough to rebuild in one pass.
Dependencies
Dependencies are design decisions. Generated implementation should not grant itself permission to call the network, touch storage, read the clock or mint IDs.
In my project, code in impl/** cannot call Date.now(), fetch, localStorage or crypto.randomUUID() directly. If a module needs time, network, storage or IDs, that design decision gets recorded in the contract by declaring the dependency as a constructor argument in contract/create.ts.
That moves review to the right question. I am no longer asking whether the agent used localStorage correctly. I am asking whether this module should have persistence at all.
The same pattern works for module boundaries. If a module needs another module, that dependency has to be explicit too. When a generated file crosses that boundary, the failure should point back to the rule.
For example, an agent might make order/impl/price.ts reach into the catalog implementation.// order/impl/price.ts import type { OrderDeps } from "../contract/create"; import type { PriceQuote, PriceRequest } from "../contract/interface"; import { catalogStore } from "../../catalog/impl/store";
export function priceOrder(_deps: OrderDeps, request: PriceRequest): PriceQuote { const item = catalogStore.lookup(request.sku); return { amount: item.price * request.quantity }; }
The check fails in the terminal before the diff reaches review.$ repo check
error import-boundary
order/impl/price.ts imports ../../catalog/impl/store
impl/** may only import: ../contract/create ../contract/interface
The fix is to make the catalog dependency part of the contract and pass it in from outside the module.// order/contract/create.ts import type { CatalogStore } from "../../catalog/contract/interface"; import type { OrderService } from "./interface";
export type OrderDeps = { catalog: CatalogStore; };
export declare function createOrder(deps: OrderDeps): OrderService;
After that, impl/** uses deps.catalog instead of importing the catalog implementation directly.
Behavior
Types and method signatures define shape, but they leave behavior open to interpretation. Contract cases close that gap with executable examples. In practice, those checks are ordinary unit tests.
By the time contract/cases.ts is written, the public interface already exists in interface.ts. Every dependency, including ambient capabilities such as time, storage and IDs, must be declared in contract/create.ts. That means the cases can be written before impl/** exists. They fail first, then guide generation.
Mocking also gets simpler. The test does not patch clocks, storage or network calls hiding inside the implementation. It passes those capabilities through the constructor, the same way production code does.
If the implementation changes, the cases keep the contract’s meaning pinned down. If the meaning has to change, the cases change first.
Reviewing
Review moves closer to the decisions that deserve human attention. The structure makes high-impact changes hard to hide because the public shape, dependencies and behavior examples all live in the files I expect to review.
That changes the questions I ask. Does this module have one purpose? Are the dependencies earned? Will the public shape age well? Do the examples cover the important behavior?
Inside impl/**, the agent can build a maze of adapters or name every variable twice. If the contract holds and the boundaries hold, I do not have to care.
Limits
Even with agents helping, some rules are still too difficult and expensive to encode well. A check that becomes complex, brittle or full of exceptions can create more maintenance work than it saves. The rules worth enforcing are cheap to run, easy to understand and stable enough to apply every time.
The checks need careful design and useful error messages. If the error is vague, the agent will route around the check. That can turn enforcement into whack-a-mole. Good checks need a crisp rule, a precise message and an escape hatch for cases that deserve human review.
Conclusion
Code generation has made writing software cheaper. Reviewing generated code is becoming the bottleneck. If an engineer’s attention is the scarce resource, it should be spent on the decisions that shape the system.
The enforcement layer is a bet on repeated work. The upfront cost only makes sense when agents return to the same boundaries often enough to pay it back in review attention.
This approach has removed much of the tedium of reviewing agent-generated code. It puts my effort back into engineering the system by shaping the contracts, choosing the boundaries and deciding which parts should be allowed to depend on each other. The work feels less like babysitting and more like steering.