RFC: Stopping runaway AI agent spend with atomic budget reservations
This RFC proposes a real-time budget decision plane for AI agent runs, using atomic reservations and per-run ceilings to prevent uncontrolled spending, with machine-readable state for agent adaptation.
Notifications You must be signed in to change notification settings
Fork 0
Star 0
Copy path
More file actions
More file actions
Latest commit
History
History
History
404 lines (336 loc) · 19.2 KB
Copy path
Raw
Copy raw file
Download raw file
Outline
RFC: A Real-Time Budget Decision Plane for AI Agent Runs
Status: Draft v3, for feedback · Author: Ajay Rajput · Date: July 2026 Audience: Platform/infra engineers running LLM gateways in production
- Problem
AI agents don't consume tokens the way chat does. An agent runs a loop: observe, think, act, repeat, and each iteration resends the accumulated context. By step 20 of a run with file reads, a single call can exceed 50K input tokens. Reported cases from the past year include a developer hitting $4,200 in API fees over one weekend of autonomous refactoring and a 35-engineer team receiving an $87K monthly bill; one audit of 30 teams found a 20x spread between p10 and p90 per-developer cost for the same tooling. These figures come from industry writeups rather than primary incident reports, but the mechanism they describe, unbounded loops resending growing context, is structural and reproducible.
Three gaps make this hard to control today:
Budgets attach to the wrong unit. Existing gateway budgets attach to API keys, users, or teams, over accounting periods measured in days or months. The damage unit for agents is the run: one autonomous session that needs a ceiling in dollars, not a monthly quota it can exhaust in an hour. No mainstream gateway enforces per-run ceilings.
Budget enforcement is implicit and fragile in incumbent gateways. Recent budget-enforcement regressions in LiteLLM (e.g. #26672, #27381, #27480, with new budget issues continuing to appear) illustrate the underlying design problem: enforcement implemented as scattered callbacks with no explicit authorization step is hard to test and easy to silently break. Separately, models with missing price metadata have been treated as free, bypassing all budget checks, and team-level enforcement sits behind an enterprise paywall. The lesson is not "gateway X is buggy"; it is that budget authority should be an explicit, testable decision point with stated guarantees.
Enforcement is blind, so agents can't adapt. When a budget check fails today, the request dies with an opaque error. The agent never learns it was approaching a limit, so it can't do what a cost-conscious human does: downshift to a cheaper model for routine steps, narrow context, or wrap up. Visibility without a feedback channel produces bill shock; blocking without one produces broken runs. Neither changes agent behavior.
- Thesis
AI agent spend control needs a real-time budget decision plane. This RFC defines a run-scoped budget authority that atomically reserves estimated spend before provider calls, reconciles against actual usage after calls, fails closed on unknown prices, and exposes machine-readable budget state so agents can adapt before they exhaust the run.
- Goals
Per-run budget ceilings enforced before the provider call, with stated correctness guarantees under concurrency.
A machine-readable budget-state protocol (response headers plus RFC 9457 problem-detail errors) that lets agents adapt mid-run instead of failing blind, without mutating successful provider responses.
Per-run cost attribution rolling up to user, feature, and team, without depending on provider-side billing tags.
Fail-closed pricing: a model with no known price is unroutable unless an explicit tenant override exists.
Non-goals
Not a model gateway or provider abstraction. This is a budget-decision plane that can be embedded as a gateway hook, sidecar, or SDK middleware. Successful provider responses pass through unmodified.
Not agent-efficiency tooling (context compaction, caching, sub-agents). The frameworks and labs own that layer.
Not post-hoc cost dashboards alone. Attribution exists here to make enforcement trustworthy.
- Design
4.1 Concepts
Run: one agent session. See identity rules in 4.2.
Ceiling: a USD limit attached to a scope (run, user, team, key, feature tag).
Budget Decision: the central primitive. The pre-call authorization result produced by the authority: allow, downgrade, advisory_warn, or block (reservation is the internal action backing allow/downgrade, not a client-facing decision value). Every decision has an ID and is logged with its inputs (scopes, estimates, effective output cap, price table version).
Reservation: an atomic hold of estimated cost against one or more scopes, made before forwarding, committed or released after.
Ledger (per scope): committed_usd, reserved_usd, available_usd = limit_usd − committed_usd − reserved_usd, plus reservation records (reservation_id, expires_at, price_table_version).
Estimate: pre-call cost projection from the price table and token counts. In hard-gate mode the reservation basis is worst-case: actual input tokens plus effective_max_output_tokens at output price (see 4.4).
4.2 Run identity
Client-supplied run IDs are convenient and untrustworthy. Rules:
X-Run-Id is accepted only from authenticated callers and is bound server-side to the authenticated key/user/team. A run ID cannot be attached to a different principal than the one that created it.
Absent a run ID, the authority issues a server-side run ID and returns it in response headers.
All ledger writes bind the full tuple: run_id + user_id + key_id + team_id + feature_id.
Cardinality controls apply (see 4.9): max active runs per principal, run TTL.
4.3 Decision flow (reserve → commit → refund)
request → resolve scopes (run, user, team, key, feature) → compute effective_max_output_tokens (4.4) → estimate cost of requested model (worst-case basis in hard_gate) → ATOMIC: reserve estimate against ALL applicable scopes, or fail ├─ all scopes fit → forward request ├─ blocked, valid alternative → downgrade (policy-controlled) or block with alternatives └─ blocked, no alternative → block with problem-detail error → on provider success: commit actual cost, release unused reserve → on provider failure: release reserve → on missing result: reservation expires after TTL, reconciled asynchronously → attach budget-state headers to every response
Reservation across multiple scopes is a single atomic transaction (one Redis Lua script or one SQL transaction): all scopes reserve or none do. Sequential per-scope locking is explicitly rejected (partial reservations, deadlocks). A request may fit the run ceiling and still be blocked by the user ceiling; the decision reports the blocking scope, but the transaction touches all scopes.
Downgrade semantics. If downgrade is selected, the authority reserves against the selected alternative model, not the originally requested model. Auto-downgrade is allowed only when the alternative satisfies capability contracts (4.8) and tenant policy permits downgrade for the request class.
Reservation state machine. Idempotency is defined over explicit states:
reserved → forwarded → committed reserved → released reserved → expired → reconciled
All transitions are idempotent. A commit for an already-committed reservation is ignored. A release after commit is ignored. A retry carrying the same idempotency key returns the existing decision rather than creating a new reservation.
4.4 Effective output cap
The hard-gate guarantee depends on who controls max_output_tokens. The authority estimates against effective_max_output_tokens, never blindly against the client-supplied value:
effective_max_output_tokens = min( client_requested_max_output_tokens, tenant_policy_max_output_tokens, model_context_remaining_output_limit, budget_derived_max_output_tokens # opt-in only, see below )
budget_derived_max_output_tokens is treated as ∞ unless tenant opt-in clamping is enabled; the min() never budget-clamps by default.
If the client omits max_output_tokens, enforce mode applies a tenant default or rejects the request. If the client requests a value above tenant policy, the authority clamps or rejects according to policy.
Budget-derived clamping is opt-in tenant policy, never default. Shrinking a generation to fit remaining budget is a behavior change disguised as accounting: a truncated diff or half-written JSON can be worse than a clean block. When enabled and applied, clamping must be visible (X-Budget-Output-Clamped: true, plus both values in the decision record and problem body).
4.5 Enforcement modes
Per tenant/scope, answering "how wrong can the estimate be?" explicitly:
Mode Behavior
advisory_estimate Log and emit headers only. Nothing blocked.
soft_gate Block only if estimate exceeds remaining by a configured safety margin.
hard_gate Atomic worst-case reservation required before forwarding.
actuals_only Allow until committed spend reaches the limit, then block new calls. No estimation trust required.
Mode and decision are different dimensions: mode is configuration, decision is the per-request outcome (see header set in 4.7). The recommended adoption path is staged: advisory → downgrade-permitted → blocking. Teams watch advisory numbers before enabling anything that can touch a production run.
4.6 Ledger implementation invariants
Money precision: all ledger amounts are stored as integer micro-USD. API responses render decimal strings. Floating-point arithmetic is forbidden in reservation, commit, and reconciliation paths.
Redis Cluster: all ledger keys touched by one reservation script must share a hash tag (e.g. {tenant_id}:budget:scope:...) so the multi-scope Lua transaction stays single-slot.
SQLite fallback is single-node/dev mode only and is not a multi-instance production ledger.
Price versioning: every decision records provider, model, input price, output price, cache-read and cache-write prices, currency, and price_table_version. Tenant-specific price overrides are supported. Unknown price is unroutable unless an explicit tenant override exists.
4.7 Budget-state protocol
Successful provider responses are not body-mutated by default. The authority must not break OpenAI-compatible response contracts, SDKs, streaming clients, or eval harnesses. Headers expose the decision and the tightest applicable scope; full multi-scope state is available through logs, audit export, or a lookup endpoint:
GET /budget/decisions/{decision_id}
Decision records are retained for a configurable window (default 30 days) so the lookup promise is meaningful. An optional envelope mode may include full budget state in response bodies for clients that explicitly opt in.
Response headers:
X-Budget-Decision: allow | downgrade | advisory_warn | block X-Budget-Decision-Id: bdgdec_01J... X-Budget-Reservation-Id: rsv_01J... (when a reservation was made) X-Budget-Enforcement-Mode: advisory_estimate | soft_gate | hard_gate | actuals_only X-Budget-Blocking-Scope: run (present when constrained) X-Budget-Remaining-USD: 2.86 (tightest applicable scope) X-Budget-Requested-Model: claude-sonnet (on downgrade) X-Budget-Selected-Model: claude-haiku (on downgrade) X-Budget-Output-Clamped: true (only when clamping applied) X-Budget-Price-Table-Version: 2026-07-04 X-Run-Id: run_abc (echoed or server-issued)
Blocked requests return 402 Payment Required by default (status configurable, since RFC 9110 still reserves 402 for future use) with an RFC 9457 application/problem+json body:
{ "type": "https://modelmuxer.dev/problems/budget-exceeded", "title": "Budget exceeded", "status": 402, "detail": "Estimated request cost exceeds the remaining run budget.", "code": "run_ceiling_reached", "budget": { "scope": "run", "run_id": "run_abc", "limit_usd": "5.00", "committed_usd": "4.91", "reserved_usd": "0.00", "remaining_usd": "0.09", "estimate_usd": "0.31", "effective_max_output_tokens": 4096, "client_requested_max_output_tokens": 8192, "price_table_version": "2026-07-04" }, "alternatives": [ { "model": "claude-haiku", "estimate_usd": "0.04"
[truncated for AI cost control]