We Get AI Costs Under Control
An exploration of FinOps for AI, focusing on token-based cost attribution, transparency, and control mechanisms such as AI proxies, limits, and guardrails to prevent cost explosions.
FinOps for AI: How We Get AI Costs Under Control - forwardnow GmbH
Before you continue to forwardnow We would like to use third-party cookies and scripts to improve the functionality of this website, including Formspree, Google Analytics, and Hotjar.
Approve Deny More info
FinOps for AI: How We Get AI Costs Under Control
Why tokens are becoming the new unit of cost, and how transparency, clear limits, and empowered teams keep it under control
Request now
A new cost dimension with familiar patterns
One example caused a stir across the industry: an AI consultant told the news outlet Axios that one of their clients burned through roughly half a billion dollars in a single month because no usage limits had been set on the employees’ AI licenses. The case sounds extreme, and at that scale it is the exception. The underlying pattern is not. On a smaller scale, many organizations are experiencing it right now: a bill jumps from a few hundred to several thousand euros a month, without any alarm going off in the system or any way to tell which service or which user caused the increase.
Anyone familiar with FinOps from the cloud world recognizes the problem immediately. It is not the individual expensive token that drives the cost, but the lack of visibility and the absence of clear boundaries. This is exactly where FinOps comes in: it connects financial governance with operational transparency, so that business units, IT, and finance can make decisions on a shared data basis. The same logic applies to AI. Only the unit of cost has changed.
The good news up front: there is now an open standard for this transparency. Just as logs, metrics, and traces became established in classic monitoring, the OpenTelemetry GenAI Semantic Conventions provide a uniform vocabulary to capture AI consumption in a vendor-neutral way and attribute it to individual sources. This standard is the common thread running through the sections that follow, from attribution through architecture to ongoing controlling.
In short, FinOps for AI rests on five levers: first, transparency through token attribution, meaning the assignment of every model call to user, team, and feature; second, an AI proxy as a central control point for all AI traffic; third, clear limits and guardrails that prevent uncontrolled costs; fourth, continuous optimization through the right model choice, lean calls, and caching; and fifth, empowered teams that understand and own their costs. The following sections walk through these levers one by one.
Tokens are money: AI does not bill like classic software
Classic software is mostly billed per license or per seat. Costs are predictable and rarely change between two billing periods, and procurement runs through a clearly defined purchasing process. AI behaves differently. Here the unit of cost is the token, and costs arise anew with every single call, depending on the length of the input, the length of the response, and the chosen model.
A parallel to the cloud is decisive here, and many underestimate it: the purchasing decision is democratized within the organization. Just as the cloud shifted the procurement of infrastructure out of purchasing and into the hands of engineering, AI distributes spending authority even more finely. No longer does a central body decide on costs; instead, every developer triggers real spending with every prompt, every model choice, and every agent they start. The frequency at which cost-relevant decisions are made is therefore orders of magnitude higher than with classic software.
The effect is especially pronounced with agentic workflows, that is, AI systems that handle multi-step tasks autonomously. Such processes consume a multiple of a single model call because they work in loops, repeatedly carry context along, and generate intermediate steps. A single careless loop or an unbounded background job can thus cause significant costs in a short time. Whereas a cloud misconfiguration often unfolds its effect over days, an agent running out of control can become costly within minutes.
This shifts the central question. It is no longer whether AI should be used, but how its consumption can be made visible, attributed to individual sources, and limited when necessary.
From cloud tagging to token tagging
The most important idea for decision-makers is this: AI cost control is not an entirely new problem. It is FinOps with a finer granularity and a much higher velocity. Anyone who has already established a tagging strategy in the cloud already holds the decisive mindset. It only needs to be transferred to the new unit of cost.
In cloud FinOps, we attribute every resource via tags such as cost center, team, environment, or project. AI needs exactly the same discipline, only now the tags hang on every model call: user, team, feature, or workflow. Without this attribution at the moment of the call, the provider’s later aggregated bill can no longer be broken down. Anomalies cannot be explained, and the business value of an individual feature cannot be calculated.
Three parallels are particularly insightful here. First, the biggest practical problem in both worlds is coverage. In every cloud project, significant portions of resources are initially untagged and end up in an unallocated bucket. With AI the same question arises: does every single call really carry an identity? Second, in both cases the lever lies in enforcement at the source. In the cloud this means an untagged resource violates a policy. With AI it means a call without attribution is not let through in the first place. Third, the real strategic advantage emerges when the same taxonomy runs across both worlds. Then, for the first time, it becomes possible to answer what a feature costs in total, infrastructure and AI together. This is exactly the common denominator pursued by the FinOps Open Cost and Usage Specification, FOCUS for short, which increasingly brings AI consumption data into a uniform format and thus forms the natural connection to existing tools such as Apptio.
How these building blocks fit together is shown by the following architecture overview, before we look at the individual parts in more detail.
Figure: Building-block view of the AI proxy — consumers on the left, the proxy in the center with its modules for identity, usage metering, limits, routing, and telemetry export, the providers including on-prem on the right, and Dash0 below as the analysis system
Visibility first: a standard for AI telemetry
Before you can optimize, you have to measure. As mentioned at the outset, an open standard has established itself for this with the OpenTelemetry GenAI Semantic Conventions. Concretely, they define a uniform vocabulary for AI telemetry, for example for the model used, the provider, the type of operation, and the consumption of input and output tokens. The advantage for organizations is considerable. Once you instrument against this standard, you are not tied to a single provider but can freely send the data wherever it is to be analyzed.
To attribute costs to individual users, your own business attributes are added to these standardized fields, that is, user identifier, team, feature, or cost center. It is on exactly these attributes that aggregation later happens. It is the same idea as the trace ID or correlation ID in classic logging, with which an entry can be unambiguously assigned to a request or business process. Only here the identifier does not serve troubleshooting but economic attribution.
In practice, this looks like wrapping every model call in a span that carries the standardized fields and your own attribution attributes:
from opentelemetry import trace
tracer = trace.get_tracer("ai-proxy")
with tracer.start_as_current_span("chat") as span:
Standardized GenAI attributes (OpenTelemetry Semantic Conventions)
span.set_attribute("gen_ai.operation.name", "chat") span.set_attribute("gen_ai.provider.name", "anthropic") span.set_attribute("gen_ai.request.model", "claude-sonnet-4-6")
Your own attribution attributes — the basis of every cost breakdown
span.set_attribute("enduser.id", "k.herings") span.set_attribute("team.name", "customer-support") span.set_attribute("feature.name", "support-rag") span.set_attribute("cost_center", "CC-4711")
response = client.messages.create(...)
Record consumption after the call
usage = response.usage span.set_attribute("gen_ai.usage.input_tokens", usage.input_tokens) span.set_attribute("gen_ai.usage.output_tokens", usage.output_tokens)
The gen_ai.* fields follow the open standard and are therefore identical across any compatible analysis system. Fields such as enduser.id or team.name are the business-level addition along which the bill can later be broken down by user, team, or feature. What matters is that this attribution is set at the time of the call, because it cannot be reconstructed afterward from the provider’s aggregated bill.
An important note from a data protection perspective: pure cost monitoring does not require storing prompt content. The metadata is sufficient, that is, token counts, costs, model, latency, and the attribution attributes. Especially for organizations with high requirements for data protection and data residency, this separation is decisive.
A central control point: the AI proxy
The most practical architecture for bringing visibility and control together is a central passage point for all AI traffic, often referred to as an AI gateway or AI proxy. Instead of each application talking directly to the providers, the traffic runs through this single point. Applications and tools such as development environments, chat interfaces, agentic pipelines, or in-house services do not receive real provider keys, but virtual keys that are mapped internally to user, team, or cost center. Every call is thus automatically attributed, without the individual developer having to do anything extra. The building-block view shown earlier makes clear how measurement and enforcement come together in this one component.
The decisive point is that measurement and enforcement sit in the same building block. The proxy captures consumption, model, cost, and latency, enforces limits, can route simple tasks to smaller models, and exports the telemetry to an analysis system following the open standard. With this, transparency is no longer a downstream analysis at the end of the month but part of every single call. A welcome side effect is that this path also brings previously uncontrolled direct usage, so-called shadow AI, into a governed environment.
Figure: Health of the AI proxy in Dash0 — throughput, latency, and error rate per service as the basis for anomaly detection
Setting limits and guardrails
Visibility alone does not prevent a cost explosion. It is the prerequisite for being able to define sensible boundaries in the first place. In practice, a clear maturity pattern has emerged. Teams first introduce limits on the number of requests, add limits on token consumption after the first surprise bill, and introduce a hard budget limit per period and team after the second.
These limits work together on several levels. Request-count limits protect the infrastructure. Token-based limits steer the actual consumption, since tokens correlate directly with compute effort and cost. Budget limits finally prevent unexpected load spikes from batch processing or agent loops from leading to untenable bills. At the gateway, this can be declared per virtual key:
Limits per virtual key (example: Customer Support team)
key: support-rag limits: requests_per_minute: 120 # protects the infrastructure tokens_per_minute: 200000 # steers the actual consumption budget: period: monthly soft_limit_eur: 5000 # soft threshold → alert hard_limit_eur: 6000 # hard limit → calls are rejected circuit_breaker: cost_velocity_eur_per_min: 20 # stops runaway agents within minutes
Particul
[truncated for AI cost control]