The PM's Guide to Managing AI Debt
AI debt is more than technical debt; it's options debt—losing your ability to respond when AI systems break in production. This article introduces tools for PMs and AI product owners to manage AI debt, including three gauges and three levers.
Sairam Sundaresan
Jun 26, 2026
AI debt is more than technical debt. It’s options debt: losing your ability to respond when AI systems break in production. This is Part I of a series that describes the tools PMs and AI product owners can use for managing AI debt.
By the end, you’ll know how to:
Identify which kind of AI debt you’re carrying,
Recognize when scaling becomes risky,
Take the right steps without hurting customer trust, cost, or privacy
Maya, two quarters into owning the virtual agent, days before the holiday promo. The loan shark is already in the room.
Five days before the holiday promo, the Slack messages start piling up.
“The assistant keeps quoting the old return policy.” “Customers stuck in loops asking for a human.” “Order numbers showing up in logs again.”
Maya stares at her screen, coffee growing cold. She’s two quarters into owning the Intelligent Virtual Agent for a mid-sized ecommerce company. Last week’s “quick fix” has already increased wrong-answer complaints by 28%, and the Friday-through-Sunday window will bring three times the normal conversation volume. VIP cancellations spike when customers get bad answers, and finance is monitoring conversation costs closely.
Maya is in debt. Not the well-behaved kind of debt you calculate on a spreadsheet, but the unruly kind that kicks in your door when you least expect and demands payment.
Every product manager knows about technical debt: choosing a short-term solution in the present costs you in the future. But technical debt is usually well-behaved: you can estimate refactoring work, schedule sprints, and budget the engineering time. It’s like a mortgage: a known principal, manageable interest, and a clear path to pay off.
AI debt is different. AI debt is like borrowing from a loan shark. The interest rate is variable and often hidden. Miss one payment (a policy update you didn’t version, a drift you didn’t catch, a prompt chain nobody owns) and your model hallucinates, your assistant quotes a retired policy, your resolution rates tank in production, and customers start leaving.
Technical debt is a bank manager. AI debt is a loan shark. The difference is whether you can see the next payment coming.
Worse yet: because AI systems are probabilistic, opaque, and context-dependent, the cause rarely maps cleanly to the effect. Maya’s problem isn’t that her assistant is broken. It’s that her team can’t see what’s breaking, and can’t safely test fixes without risking more customer trust. As a result, Maya’s options are quickly disappearing.
Maya’s case illustrates three things.
First, AI debt is options debt. Every decision you make with an AI system either removes or preserves your ability to respond when things go wrong. And with AI, things go wrong faster and more mysteriously than with traditional software1.
Second, Maya’s case illustrates what I’ll call The Options Principle: the PM who manages options well usually outperforms the PM who manages models well, in most real conditions.
Third, Maya’s case illustrates how PMs can manage options well. It’s this third point I’m going to focus on. The previous quarter, Maya had the foresight to build some tools to get herself out of AI debt: three gauges to measure the debt and three levers to pull if things go wrong. Those gauges and levers are what let her climb out of debt in 72 hours instead of flailing for a week.
The Control Room
Three gauges, three levers, one sticky-note rule. Everything Maya does this weekend runs through this panel.
To understand Maya’s tools, picture a control room. In front of you are three gauges, each measuring a different kind of AI debt: foundation debt, drift debt, and operations debt. Each debt gauge has green, yellow, and red zones. Green means you have options: you can experiment, scale, and recover from mistakes. Yellow means you’re starting to lose flexibility. Red means you’re flying blind, and any move could make things worse.
Next to each gauge is a lever which you pull when a gauge goes red. Pulling the lever doesn’t fix the problem. It just buys you time and information so you can fix it without burning customer trust.
Governing everything is one rule written on a sticky note:
Never scale when any gauge is red or unknown.
Let’s walk through the gauges and the levers.
Gauge One: Foundation Debt
Foundation debt is about traceability: when something goes wrong, can you find out what happened? If, say, a customer complains about a wrong answer, can you pull up the conversation, see which version of the policy the assistant was quoting, and re-run it to understand why? If you can’t, you’re fixing blind.
Foundation debt isn’t the same as drift. Drift happens when the outside world changes while the model stays the same: people start asking new things, in new words, about situations the model was never trained to handle. Foundation debt happens when the scaffolding around the model changes while the model stays the same: policy versions, retrieval indices, prompt chains, or other bits of scaffolding no longer align with what’s true. Maya’s return-policy bug is an example of foundation debt: what changed wasn’t the world, but the index behind the assistant.
Gauge One measures two things: the likelihood you can reproduce yesterday’s behavior, and the likelihood that answers cite current policy. Where you draw the lines that separate green from yellow, and yellow from red will vary on a case-by-case basis. Here’s how Maya drew the lines:
Gauge One: Foundation Debt. Green: 95% or more of sampled transcripts pass both replay tests (forensic and regression). Yellow: 70 to 95% on either test. Red: below 70%, or missing citations on critical intents like refunds and cancellations. Lever: Version and Replay. PM decision: block scale until green.
Behind these divisions are two kinds of replay.
The first is forensic replay: being able to re-run an old conversation exactly as it happened (same policy, same data, same settings) and get back the same answer the assistant gave at the time. That tells you what happened and why.
The second is regression replay: running today’s assistant against yesterday’s hardest cases to confirm old bugs haven’t crept back in. Language models are never perfectly repeatable, so you’re not hunting for word-for-word matches. You’re checking that the decisions it makes, and the sources it cites, come out the same.
In Maya’s case, the return policy had changed the week before, but the assistant kept quoting the old policy. When a customer complained, no one could reconstruct what the assistant had said because the transcripts weren’t tied to a policy version. Maya couldn’t prove there was a bug, let alone fix it.
Gauge Two: Drift Debt
Drift debt happens when the world your model lives in changes, but the model stays the same. A new promo or season changes the intent mix, the spread of what people are asking for: more cancellations this week, more address changes, a flood of gift-receipt questions in December. Your dashboard still says the model is accurate because its score is measured against a frozen sample of conversations from three months ago. That old sample never included the new questions. So the number stays green while the real signs turn red: chats run longer, more people ask for a human, and fewer leave with their problem solved. The model says it’s doing fine. Your customers disagree.
Gauge Two measures whether your customers are getting less happy while your dashboard still looks fine. Again, where you draw the lines that separate green from yellow, and yellow from red will vary on a case-by-case basis. Here’s how Maya drew the lines:
Gauge Two: Drift Debt. Green: resolution within 3% of baseline, and “agent please” at or below baseline +2%. Yellow: 3 to 7% variance on either. Red: resolution down more than 7%, or “agent please” up more than 5% for two consecutive days. Lever: Shadow and Refresh. PM decision: block scale until green.
Let’s look at the 7% red line. Below it, ordinary week-to-week noise can hide a real decline; above it, something is genuinely wrong. It isn’t a fixed number: set it against how noisy your own traffic is, and how much a wrong answer costs on that particular question. Getting a refund wrong matters more than getting store hours wrong.
Maya’s classifier had been trained on tickets from the summer, a time when almost nobody asks about gift receipts. Fast forward to December. A customer asks, “Can I add a gift receipt to this order?” and the model wrongly files the question under returns. That’s an easy slip for the model to make: both cases involve a receipt and an order, and both sit in the same help section of the catalog. But the cases demand different answers, and the assistant gives the wrong answer with complete confidence.
A confident wrong answer is worse than waffling because the customer will believe it and act on it.
Gauge Three: Operations Debt
Operations debt is about unglamorous things like speed, cost, privacy, and ownership: replies get slower at peak hours, the cost per conversation creeps up, personal data like customer addresses and order numbers start turning up where they shouldn’t. Somewhere in the system sits a tangle of prompts that nobody fully understands, written by someone who left six months ago, holding three services together with default settings that no one remembers choosing.
Gauge Three measures whether replies are fast, costs bounded, logs clean, and every piece of the system owned by someone. Here’s how Maya drew the lines on her Operations debt gauge:
Gauge Three: Operations Debt. Green: TTFT under 1s, p95 turn latency under 2s, cost within target envelope, zero PII incidents in 30 days, a named owner for every prompt and adapter. Yellow: p95 latency 2 to 3.5s, or cost 0 to 20% over target. Red: p95 above 3.5s, cost more than 20% over target, or any PII leakage. Lever: Guardrail and Stabilize. PM decision: block scale until green.
Green means the first words appear fast, the time-to-first-token (TTFT) stays under a second, the entire reply finishes within a couple of seconds, the cost per chat matches what you budgeted, no personal data (PII) has leaked in the past month, and every prompt and adapter has a named owner. Red means replies have slowed past about three and a half seconds, costs have run more than 20% over budget, or some personal data has leaked. (Three and a half seconds is roughly when people start giving up on a chat; although in harder cases, like legal or medical contexts, people have a little more patience.)
In Maya’s case, replies slowed to four seconds when Black Friday hit, so customers started giving up mid-conversation. Worse yet, a privacy check found customer addresses sitting in the logs, and her team couldn’t fix it quickly because the logic was scattered across three services with no single owner.
The stakes are real. IBM’s 2025 Cost of a Data Breach Report puts the average breach at $4.44 million [2], with unsanctioned “shadow” AI adding about $670,000 on top, and 97% of the firms hit by an AI-related incident had no proper access controls in place [3].
Air Canada learned the lesson the hard way: in court. In 2024 it was held liable when its chatbot gave a grieving customer the wrong bereavement-fare policy. The tribunal rejected the airline’s argument that the bot was somehow separate from the company [4].
Maya’s problem is similar: a customer-facing assistant confidently stating something that isn’t the company’s policy.
Klarna makes the point from the other direction. In 2024 it boasted that its AI did the work of 700 agents. By 2025 it was hiring people back: cutting costs had cut service quality with it [5]. Scale fast without instruments that let you see what’s breaking, and the speed itself becomes the thing that hurts you.
There are three more things to say about the gauges and the kinds of debt they measure befo
[truncated for AI cost control]