AI News HubLIVE
In-site rewrite5 min read

Moneyball for Physical AI

This article applies the 'Moneyball' concept—using data-driven statistical analysis to find undervalued assets—to the field of Physical AI. It argues that robot data is currently mispriced, with overemphasis on volume and teleoperation hours rather than data novelty and marginal utility. By analyzing scaling laws and the economics of data collection, it proposes that capital efficiency in Physical AI depends on accurately computing and pricing data novelty, not maximizing data volume.

SourceHacker News AIAuthor: gmays

Animesh Garg

Jun 25, 2026

In 2002, the Oakland Athletics won 103 games despite maintaining the third-lowest payroll in Major League Baseball. This advantage emerged because the market for player assets was mispriced: legacy scouts favored subjective aesthetics, stolen bases, and batting averages, whereas forward-looking management mathematically isolated on-base percentage, the statistic that actually correlated with runs.

Finding the signal with the correct statistic in a field full of intuitive pundits: Moneyball!

Data for Physical AI is misunderstood, and mis-priced.

Data doesn’t exist for Physical AI. Data has a inherent cost of creation. We need to move beyond from naive scaling data in hours or tokens.

Being scale-pilled often amounts to “believe in data”. However, unlike text, robot data isn't available to be mined. Every useful hour is paid for, so collection scales linearly while costs don't fall. Recently, Ken Goldberg estimated that frontier robotics models might require approximately 100,000 years.

AGI revolution will not be supervised with Sweatshop Teleop.

To bypass this bottleneck, the industry has scaled manual teleoperation infrastructure. However, optimizing for cumulative operational hours replicates the “batting average” fallacy of early baseball: it prioritizes a visible, easily fundable metric that correlates weakly with actual downstream model performance. An alternative strategy proposes deploying robots into production to harvest telemetry as a zero-cost byproduct of operational revenue. This model introduces a subtler version of the same statistical error. The niches where deployment is possible today are the ones with least variance and yield low-entropy, correlated data streams with minimal marginal utility.

This essay builds a framework for the marginal utility of data, and uses it to discuss value accrual in Physical AI. We take the perspective of the scaling laws that guide how loss behaves with data, and the unit economics that govern what a dollar of data is worth. Together they give an approximate marginal utility per dollar, the on-base percentage of physical AI.

Capital efficiency scales not by maximizing data volume, but by accurately computing and pricing data novelty. If you’d rather skip to conclusions, jump to recommendations.

  1. Stakeholder Biases in the Data Supply Chain

Varied stakeholders have differing views on data. Conveniently, each worldview happens to make their slice the most valuable.

Foundation-model labs sell generalized model scale, as a result overweight the role of large-scale pretraining, operating under the assumption that raw compute scaling will eventually eliminate edge-case errors. Teleoperation vendors are infrastructural utility that prioritize and monetize raw operational hours, since their revenue scales with data volume rather than utility or novelty. Hardware incumbents operate on the assumption of environmental stationarity, since their solution fails out-of-distribution. And large camp of academic roboticists denies it is a data problem at all and expects physics, models, and control to close the gap without the deluge.

The key archetype to analyze is the neo-integrator. This model attempts to bypass data-collection bottlenecks by deploying specialized robotic units into commercial production, utilizing human-in-the-loop oversight to manage operational failures. The core thesis relies on an unproven economic flywheel: production telemetry will generate the novelty required to train multi-task capabilities. Evan Beard of Standard Bots makes the case at length. Kyle Vedder pushes back on deployment first, arguing that the environments willing to pay for early-stage deployment are naturally low-variance, creating a "novelty pump" constraint.

We analyze this debate through a neutral framework combining empirical scaling laws and the unit economics of data capture, isolating exactly which allocation strategy yields the highest model capability per dollar.

  1. Taxonomy of Robot Data

Data operations in physical AI map across three modalities, each defined by trade-offs between cost and information density:

Observational Data: Low-cost, high-breadth, action-deficient corpora (e.g., egocentric and exocentric video). This modality expands support of the representation, but lacks direct action supervision.

Interventional Data: High-cost, low-breadth, action-dense demonstrations (e.g., teleoperation). This modality maps explicit state-action trajectories but scales linearly with human labor.

Deployment Data: Endogenous telemetry generated by production systems, often running at a loss. This modality is un-curated and samples an environmental distribution dictated by commercial operations rather than algorithmic design.

Data maximization often introduces low-entropy noise that degrades training efficiency. As demonstrated by the C4 dataset in language modeling, subset subtraction results in model improvements. Notably, filtering boilerplate and near-duplicates to maximize distinct token coverage within a fixed budget.

As stakeholders, the questions we have to ask are these. What does a dollar buy in each type of data? Where does new information come from? And can deployment, the data we are paid to collect, widen the set of tasks we can deploy, or does it run dry quickly?

Evaluating a data pipeline is a capital-allocation problem: balancing the marginal cost of data against novel information and ability to advance the model’s generalizability.

  1. What do the scaling laws tell us?

The scaling-law literature answers these questions on language models. What matters about a dataset goes beyond its size: how many distinct examples it holds, how diverse the mixture is, how often each example repeats, and how close new data is relative to existing data.

3.1 Does more data help?

Yes, as a power law with diminishing returns, down to a floor. Test loss falls as a straight line in log-log against data, model size, and compute (Kaplan 2020). With size N and tokens D, under the joint scaling formulation (Hoffmann 2022) loss is modeled as:

\(L(N,D)=E+A N^{-\alpha}+B D^{-\beta} \)

Straight power law against compute & data on log scale (Kaplan et al 2020)

The functional form is consistent, while the numerical values remain approximations (Besiroglu 2024). At the compute-optimal allocation the two reducible terms decay at the data rate and collapse to a one-dimensional envelope,

\(L^{*}(D)=E+\tilde B D^{-\beta} \)

The constant E represents the model's irreducible predictive uncertainty.

3.2 Does diversity help?

Yes, operating across independent axes from dataset volume. A diverse data mixture yields two simultaneous effects: it drives down the asymptotic error floor via cross-domain transfer and expanded manifold coverage, and it increases the intrinsic dimension of the dataset (dint). In the resolution-limited regime β ≈ 4/dint for a smooth target, where dint is the intrinsic dimension of the data manifold (Sharma & Kaplan 2020; Bahri 2021).

Because β enters as an inverse of dimension, halving a task's intrinsic dimension roughly doubles the scaling exponent: the loss curve falls faster. But this is at the cost of convergence to an inferior optima which doesn’t yield generalization. To maximize generalization, pre-training distributions must deliberately avoid artificially low intrinsic dimensionality.

The data-mixing law (Ye et al. 2024) decomposes a mixture's loss into orthogonal per-domain power laws and cross-coupling terms, which dictate either positive transfer or negative interference.

3.3 Does repetition help?

Repetition provides marginal utility up to approximately four epochs, matching the efficiency of fresh tokens; beyond this threshold, utility decays rapidly, eventually degrading capability. Muennighoff et al. (2023) fit exponential saturation with half-life R* ≈ 151: four passes incur negligible penalty, while sixteen passes define a strict regime of diminishing returns where additional compute yields zero information gain. Furthermore, over-indexing on a narrow data fraction drives a localized double-descent anomaly in test loss and fundamentally degrades circuit mechanisms, specifically induction and copying heads, that govern in-context learning (Hernandez et al. (2022)). Repeating just 0.1% of a corpus 100 times collapses the downstream performance of an 800M-parameter model to that of a 400M-parameter baseline, demonstrating that even minor distributional redundancies act as massive capital drains.

Loss of LLMs (4.2B parameters) scaled on repeated data decays resulting in worse than expected performance (Muenninghoff et al 2023)

3.4 What if the data is nearly the same?

Near-duplicates exist on a utility continuum bounded by exact repetition and entirely novel samples. Removing these redundancies improves model generalization while optimizing the token budget for distinct manifolds. Lee et al. (2021) found that individual sentences appearing over 60,000 times within the C4 corpus. Redundancy in large-scale corpora necessitates systematic deduplication to mitigate verbatim memorization while accelerating convergence velocity. Mechanistically, a small perturbation forces a model to map identical targets across a bounded neighborhood (x and x + ε), serving as an implicit consistency regularization. Consequently, the near-duplicates are very low utility. At moderate ε, regularization is useful, and as ε expands, it becomes a distinct data point. Densely-sampling within a narrow neighborhood rapidly saturates local capacity, and hurts model performance.

3.5 What about discovery of long tail?

Rare, out-of-distribution (OOD) events yield outsized marginal utility because model performance at the scaling limit is constrained by the failure tail. Real-world physical distributions are heavy-tailed; scaling macro-capabilities emerges from mastering a Zipfian distribution of subskills acquired sequentially based on frequency (Michaud et al., 2023). Achieving frontier accuracy requires fitting these rare subpopulations, which collectively constitute a large volume of total operational density (Feldman, 2020). Consequently, optimizing a corpus by filtering for high-difficulty, low-frequency samples can bypass standard power-law scaling constraints entirely (Sorscher et al., 2022). Because these edge cases are rooted in real-world stochasticity, they are intractable to replicate via synthetic generation or structured staging. However, as the model’s known distribution expands, remaining novel variations become exponentially rarer, driving a steep increase in the marginal cost of discovery.

Summary:

More data buys a power law down to a floor.

Diversity lowers the floor at the cost of rate.

Repetition buys little and eventually hurts performance.

Near-duplicate data is the weakest of all, short of a deliberate small perturbation.

The long tail rare events are very informative, yet are increasingly costlier to discover.

  1. Economic Perspective: Marginal Utility per Dollar?

In language modeling, compute is the binding constraint and data is abundant and low-cost. Conversely in robotics, useful data is strictly constrained by data acquisition costs. Consequently, the objective function shifts from maximizing compute efficiency to maximizing marginal loss reduction per dollar.

The global capability target is modeled as a convex combination over discrete task clusters j with assigned prior weights (π j). Each independent cluster obeys a distinct scaling envelope conditioned on environmental parameters:

\(L_j=A_j(\phi)+B_j(\phi)\,D_j^{-\beta_j} \)

a floor Aj(φ) over a data-reducible term, with exponent βj ≈ 4/dj set by the cluster’s intrinsic dimension (dj).

To optimize a finite capital allocation, resource expenditure must equalize the marginal value per dollar across all available collection and curation channels.

Interventional Channel: Activ

[truncated for AI cost control]