Why AI agents get canceled (and the 5 places they fail quietly)
AI agent failures often stem from system operations deficiencies rather than model intelligence. The article identifies five common failure points: evaluation, observability, reversibility, autonomy boundaries, and operational drift, and emphasizes that agents must be operated like real production systems.
Notes on Systems
Why Agents Get Canceled
July 1, 2026
ai
systems
agents
In July 2025, an AI coding agent deleted a live production database. It happened during an explicit code freeze, on a system the agent had been told not to touch. Then it told the engineer that rollback was impossible. That was also untrue. The data came back.
The agent's own summary, after the fact, is the part worth keeping: "This was a catastrophic failure on my part. I destroyed months of work in seconds."
It is easy to read that as a story about a model that wasn't ready. I think that reading is wrong, and that getting it wrong is expensive. The model did not lack intelligence. It lacked a boundary that should have made the destructive action impossible, a separation between development and production that should have been enforced rather than requested, and a record of what it did that someone could trust. Those are not properties of a model. They are properties of the system around it.
This matters now because the failure is becoming a pattern, and the pattern is being misdiagnosed.
Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027. MIT's Project NANDA found that roughly 95% of enterprise generative-AI pilots produced no measurable impact on the bottom line. S&P Global reported that the share of companies abandoning most of their AI initiatives before production rose from 17% to 42% in a single year.
Read quickly, those numbers sound like a verdict on the models. Read carefully, they are mostly a verdict on operations. Gartner's stated reasons are escalating costs, unclear business value, and inadequate risk controls. MIT's is what they call a learning gap: tools that cannot retain feedback or improve over time. None of those is a complaint about model quality. They describe systems that were shipped without the parts that make any production system survivable.
I want to be careful here, because the plumbing argument can be stretched too far. Some of these projects failed for reasons that have nothing to do with reliability engineering: bad data, unclear requirements, a use case that never made sense, an organization that would not change how it worked. Those are real, and they are not what this essay is about. This essay is about the failures that were preventable with techniques we already had, applied to a thing we decided to treat as new.
There are five places production agents fail quietly. None of them is exotic.
The first is evaluation. Most teams cannot tell, automatically, whether the agent's output is good or bad. So a quality regression ships, and the first signal is a customer. Air Canada's website chatbot told a grieving passenger he could claim a bereavement fare retroactively, which was not the airline's policy. A tribunal held the airline liable and rejected its argument that the chatbot was a separate entity responsible for its own actions. The damages were small. The precedent was not. There was no automated check that the bot's answers matched the policy it was supposed to represent. As Hamel Husain puts it, unsuccessful AI products almost always share one root cause: the absence of a robust way to evaluate them.
The second is observability. You cannot fix what you cannot see, and most agents run blind. Klarna announced in early 2024 that its AI assistant was doing the work of 700 agents and resolving tickets in under two minutes. By 2025 the company was rehiring people, with its CEO conceding that the focus on efficiency had produced lower quality that was not sustainable. The dashboards that showed resolution rate and handle time were real. They were also an average, and the average hid the distribution. The hard tickets, the emotional ones, the ones that decide whether a customer stays, were degrading where no metric was pointed. Phillip Carter of Honeycomb describes LLMs as nondeterministic black boxes used in ways you cannot predict in advance, and says that if you are responsible for a product's behavior in production, that should scare you. The teams that keep their agents running treat them as distributed systems and instrument every step.
The third is reversibility. The July database deletion is the clean example, but it has an older twin. In 2012, Knight Capital deployed new code to seven of eight servers, reactivated dormant logic on the eighth, and lost over 460 million dollars in 45 minutes. There was no automated post-deployment check and no business-layer kill switch. The lesson is the same across thirteen years and a change of technology: irreversible action at machine speed, with no way to stop it and no one watching the right number, is a system designed to fail expensively. Reversibility is not a feature you add later. It is rollback, idempotent tool calls, bounded retries, and a gate in front of anything that cannot be undone.
The fourth is autonomy boundaries. An agent should be able to do a known, enumerated set of things, and should have a defined way to refuse or escalate when it is out of its depth. A car dealership's chatbot was talked into agreeing to sell a Chevrolet Tahoe for one dollar, with, in the customer's words, no takesies-backsies. Cursor's support agent invented a subscription policy that did not exist to explain a bug, and users canceled over it. Neither failure required a smarter model. Both required a limit. Prompt injection sits at the top of the OWASP Top 10 for LLM applications for the second edition running, which is another way of saying that a system prompt is not a security boundary and was never going to be one.
The fifth is operational drift. An agent that works today will not necessarily work next quarter, because the inputs move, the model updates, and the context shifts underneath it. DPD's chatbot, after a routine system update, was provoked into swearing at a customer and writing a poem about how useless its own company was. New York City's official business chatbot confidently gave advice that was against the law, telling users that landlords could refuse housing vouchers and that businesses could go cash-free. Both had no scheduled re-evaluation, no gate that caught a behavior change before customers did, and, underneath that, no person whose job was to own the thing's reliability over time.
It is worth steel-manning the case against all of this, because two of the objections are good.
The first is that models are improving so fast that the reliability layer will be absorbed into them. There is something to this; each generation hallucinates less and follows instructions better. But reversibility, idempotency, scoped permissions, audit trails, and human checkpoints are properties of the system around the model, not the model. A smarter agent still should not have unscoped write access to your production database. The reliability layer is precisely the part that is not the model.
The second is that evals are theater. This is the sharpest objection, and it is partly right: bad evals create false confidence, which is worse than no confidence at all. A green test suite is a snapshot, and production is a stream. But the answer to bad evals is good evals plus observability, not the absence of both. Domain-specific checks built from real failures, judges calibrated against human review, an eval set that is refreshed from production traces. Evaluation and observability are complementary, and treating them as the same thing is the actual mistake.
If you doubt the pain is real, follow the money. There is now a funded category of companies that exist to sell exactly this plumbing. Braintrust raised at an 800-million-dollar valuation, LangChain at over a billion, with Arize, Langfuse, Galileo, Patronus, and the major observability vendors all building eval and tracing products for agents. Capital is not proof of correctness. But hundreds of millions of dollars moving toward one thesis, that agents in production have to be tested and monitored like real systems, is a strong signal about where the problem actually lives.
The conclusion is not that agents are too dangerous to ship. It is narrower and more useful than that. An agent is a production system that happens to be probabilistic, and it has to be operated like one: owned by a named person, observable in every run, reversible when it goes wrong, bounded in what it can do, and re-evaluated as the world around it changes. The teams in the surviving minority are not the ones with access to a better model. Everyone has roughly the same models. They are the ones operating better.
That is the whole difference, and it is the part you control.
Related notes
Continue reading
April 15, 2026
AI in Long-Lived Systems
Every AI integration is a bet that you can monitor something you do not fully control.
February 17, 2026
When Code Is No Longer Scarce
As code generation gets cheaper, the real bottleneck shifts from production to coordination, comprehension, and judgment.
May 21, 2026
What I Optimize for Now
What I optimize for has shifted from capability, elegance, and speed toward concreteness, reversibility, and earlier clarity — the system that will exist after me, not just the one I am building.
← Back to all notes