AI News HubLIVE
站内改写6 min read

AI Has a Measurement Problem – And it's everyone's problem

AI tools have been rapidly adopted in tech companies, but measuring their actual value remains a challenge. The article points out that many companies blindly spend huge sums on AI without connecting spending to output, leading to waste and blind cuts. The author proposes an attribution-based measurement method based on his own experience to link AI spending to work results.

SourceHacker News AIAuthor: gallardo147

Luis Gardea

Jun 08, 2026

This is my first blog post, so quick intro: I’m a software engineer working on growth and experimentation at Instacart. I’ve been building with AI outside of work and I’m intellectually hooked on this space. I expect to write about AI, tech, and finance, but I’ll go wherever interesting problems are. Opinions are my own. With that out of the way, here’s what I want to talk about.

Tokenmaxxing – the symptom

AI tools at tech companies have been rapidly adopted over the last couple of years. Uber has been on the news lately with their COO saying that he can’t draw a line from the rising usage of Claude Code and token spend to useful shipped features. They’ve been spending $500-$2000 per month per engineer, with adoption partially driven by internal leaderboard dashboards ranking teams or engineers by token usage with reports that they blew through their 2026 AI budget in 4 months.

Tokenmaxxing has been a live trend at other companies too. Salesforce set minimum daily spend targets and built dashboards showing what employees were spending. Meta’s Claudeonomics leaderboard was created internally by an engineer and has since been killed. Amazon had a similar leaderboard that was pulled after gaming concerns. Jensen Huang said he’d be “deeply alarmed” if a $500k engineer wasn’t burning $250k in tokens per year. And a company (obviously a very large one) reportedly spent $500m on tokens in a single month.

EntelligenceAI analyzed over a million pull requests across 2,400+ engineering organizations and found that only 18% of AI coding spend mapped to shipped products that reached real users. The rest went to reactive work, rework, and review friction.

So it’s not just that employees are gaming metrics. There’s a measurement vacuum and the gaming is a symptom.

The correction to it is also already underway: Uber and others are starting to ration access, cap spend, and pull tools. But the impact of this is also unmeasured (more rational than limitless spend, but still unmeasured). Companies spent blindly and now are cutting blindly. There’s an inability to connect spend to value, and this recurs at every scale. Closing the measurement gap, connecting spend to impact where possible and honestly bracketing where it can’t, is the leverage.

Why now?

In 2026, execution is cheap. When execution was the bottleneck, scarcity did your prioritization for you. AI removed the bottleneck, and now the prioritization question is exposed.

Cheap to try means more things tried at a higher false-positive rate. Whether the new exploration rate is net positive is itself unanswerable without measurement. Even Anthropic reports this internally: “explosion of new ideas, initiatives, tools, and simulations – far more than we have the capacity to pursue”, with human code review becoming the new bottleneck once generation scaled. Even the most advanced labs are hitting the same wall: with execution getting cheap, the constraint has shifted to deciding what’s worth doing.

Quality evidence is mixed, and the mix is the point. Analysis from Georgia Tech found AI-generated CVEs tripled between Q4 ‘25 and Q1 ‘26; Waydev found more accepted code with more rework. But Jellyfish found no statistically significant relationship between AI adoption and bug or revert rates, and METR’s RCT found experienced devs were 19% slower with AI while perceiving a speedup. The evidence is conflicting because nobody measures it cleanly.

The obvious half-fix, measuring tasks and not tokens, just moves the gap up. You learn how many PRs were merged, maybe which tokens went to which PR, but not whether the PR mattered. So you learn if more work is getting done, but not the impact it had.

The same gap, three layers up

Without measurement, every adoption decision is driven by what peers are doing rather than demonstrated returns. It’s FOMO, wanting what high-status others want, applied to AI spend.

Inside a tech company, this shows up as tokenmaxxing. Boards pressure management to show AI adoption because peer companies show AI adoption and management translates that into spend targets (or AI-generated PR percentage). Those targets turn into leaderboards and employees optimize the only visible metric. The whole chain is running on FOMO at every step, Goodhart’s Law meeting Girardian desire. Goodhart’s Law, “When a measure becomes a target, it ceases to be a good measure” shows up everywhere from A/B testing to reward hacking in AI/ML. Tokenmaxxing is that same failure applied to engineering productivity.

Applied to AI lab revenue: labs capture the token revenue but it’s impossible for them to separate durable from performative demand. The measurement gap might flatter their quarter but it undermines their planning in the same motion. If even 15-20% of enterprise usage is performative and corrects, the lab’s revenue drops in a way they couldn’t have modeled.

In 2026, $725b is expected in capex spend, up 77% from 2025, with 75% of that being AI-specific. This is not pulled by enterprise tokens alone; it’s justified by own-product inference, backlog contracts, training, and supply constraint signals. Performative tokens don’t inflate capex dollar for dollar. The connection is subtler: if early adoption curves are partly performative, the growth trajectory looks steeper than real demand warrants, and capacity planning extrapolates from that slope.

Demand is real, the signal is just corrupted. And nobody, bull or bear, can currently tell by how much. At this scale, a correction driven by the measurement gap isn’t just a tech correction, it’s macroeconomic. There is local leverage though: you can’t really fix $725b of capex spend, but you can instrument your own org and not be the one flying blind.

What we can measure today

I want to be clear that I’m not talking about measuring capability. Nobody serious is disputing that AI can do engineering work and benchmarks exist to answer the capability question. I’ll be focusing on engineering spend, where the most expensive tools like Claude Code and Codex live, the artifacts generated from them are the most concrete (PRs, deployments, experiments), and the attribution chain is the most tractable.

Tools like Claude Code and Codex already expose token and spend telemetry; session data can be piped into observability tools like Datadog (hence the token leaderboards). PR counts and LOC exist as proxy metrics, though everyone knows they’re flawed metrics. And there are real qualitative gains that are hard to put a number on. One example is agentic reviews that can encode engineering standards into CLAUDE.md/AGENTS.md files at each codebase level, shifting review from cultural enforcement (humans enforcing standards during design or PR reviews) to structural enforcement (agents applying those standards automatically during every review). Nobody who uses these tools doubts that AI is adding value somewhere.

And measurement works when you scope it. Anthropic just released a report on their progress and productivity (and the possibility of getting closer to RSI); they’re reporting 8x lines of code per day vs pre-2025. But they themselves are honest that the 8x code output is “almost certainly an overstatement of the true productivity gain”. When they scope a specific metric upfront, they can draw the line from capability to impact: 800 fixes that reduced a class of API errors by 1000x, training code optimizations from 3x to 52x speedup on a defined benchmark, automated review catching a third of production bugs before merge. So scoped measurement works, the question is whether it can scale beyond individual projects to an organization-wide view.

Jellyfish commercialized this problem. Their AI Impact product correlates Claude Code telemetry with proxy metrics like PR throughput, cycle time, and DORA metrics. This is evidence that the gap is real. But they also have acknowledged that they hit a wall, their attribution is correlational and non-causal by their own admission. It doesn’t show whether AI tools themselves caused the improvements or whether the developers pushing them were already high performers.

Most companies of any scale already have impact measurement systems like experimentation platforms, DORA metrics, business outcome tracking. So both halves of the infrastructure exist and what’s missing is the link connecting token spend to the unit of work that is measured by those impact systems. Attribution is the missing join key.

So, what’s buildable?

By attribution, I’m not talking about activity tracking. Activity tracking counts things like tokens consumed, PRs merged, essentially dead end metrics. Attribution links the spend to a work unit that can then be connected to an outcome. You can’t connect spend to value without first connecting spend to a unit of work. It’s a foreign key, not a counter. That’s the floor.

A few months back, I hit a wall with Claude Code. I was running a ton of Claude Code sessions at once and I really started to feel some friction and hit the limits of my own executive function. It was just hard to keep track of everything at once, like decisions made in previous sessions or keeping track of the work being done in all of the live sessions. I built an outer harness to help me keep track of things, add determinism to these non-deterministic agentic workflows, externalize and persist my decisions, and cleanly manage context with role-scoped agents.

The architectural choice that turned out to be significant was routing all work through Linear as a coordination layer, so every prompt, plan and review is a persistent record tied to a ticket, with human gates in between. This made sense to me because humans use these tools to keep track of work and to break down larger tasks into smaller atomic units. And I wasn’t trying to solve the measurement gap, I was trying to add structure so that I could run more sessions at once and used a proven system that provides auth, workflow management, and persistence (and enabled me to do work while I was away from my computer in a more reliable way than Claude Code’s dispatch or remote control tools, at least at the time). But I captured the telemetry and persisted it to a local DB with tables joining jobs to tickets, simply because I wanted to keep track of my spend and I knew measurement was important. I only realized later that this provided a structural and deterministic link from multiple sessions to the work that was done.

When I learned about what Jellyfish was doing, I recognized some similarities, but they were measuring at the analytics layer, inferring from correlation. My harness was measuring at the orchestration layer, exactly where the token spend comes from. And deterministic attribution helps close the measurement gap, experiments need a clean unit to treatment link. You can’t trust your measurements if you only have a probabilistic guess at which spend produced which change.

And AI actually lowers the cost to build this attribution. Agent sessions are loggable units in a way that human effort never was (at least in software, unlike fields like law or consulting which rely on billable hours); the tools emit rich telemetry data by default.

Attribution is not enough, you also need a measurability taxonomy. Attribution tells you where the tokens went, but the taxonomy tells you how to evaluate whether they mattered. I don’t claim to have the perfect taxonomy, but we can roughly break work into:

Experimentable work like feature work, UI or algorithmic changes. The gap is the most closable here, it’s mostly a wiring problem plus a few tweaks. Companies already have A/B testing infrastructure and data scientists. The work would be to add structural enforcement, like requiring a ticket link when creating the experiment. Adding one field to an existing form and, from there, reading experiment results against token spend is tractable.

Sequenceable

[truncated for AI cost control]