Form Before Data: The Real Bottleneck for Physical AI
The bottleneck for physical AI is not intelligence but the right physical form and senses to collect real-world data. Tesla succeeded with cars because the car was already the right shape. Humanoid robots lack tactile sensing and task-specific data. Current successful physical AI applications are not humanoid but simple arms with advanced vision, e.g., in agriculture.
Jun 21, 2026
A reader messaged me last week with a question about a topic that has been in my backlog for a few months now, AI and the physical world. The request was the following “Can you elaborate on the rate of adoption of AI for the physical world? We see [it] operating almost entirely in the digital realm. The Tesla FSD vehicles are examples of AI moving in the physical world. We are also beginning to see other machines such as humanoid robots move through space by interpreting the visual field. But these examples are still very uncommon.”
He’s right, and while “still very uncommon”, the field is making progress fast. We have AI that writes code, drafts contracts, and passes the bar, and we have a handful of cars and factory robots, and almost nothing in between. But my feeling is that the gap isn’t intelligence, I think the models and foundational technology is there. What we are missing is the right “body” and “senses” for the model to make sense of the world, and the data needed to teach it how to navigate it.
That’s the thesis I want to make the case for in this post. Tesla cracked self-driving first not because its models were the smartest (until quite recently they were using traditional visual pattern recognition models instead of using deep-learning end-to-end), but because the car was already the right shape to act in the world for their specific task. It rolls, it steers, it has somewhere to put cameras. The form of the robot and the actions it had to perform in the physical environment were already well-defined. Everything physical AI does next is a search for that same fit: the right form for each task, and the intelligence to drive it through a messy, imprecise, badly-lit world that no simulation fully captures.
Why the car came first
A car is a strange thing to call an autonomous robot, but that’s essentially what a self-driving car is: a machine that senses its environment and acts in it. And it turns out to be an unusually “easy” (big quotes) robot. It moves in two dimensions. It has four contact points with the world and they never change. It can’t fall over, it can’t drop anything, and the rules of the road are written down. Compare that to a hand picking up an egg, where success depends on grip force you have to feel rather than see, and you start to understand why driving fell first. The car was already the right shape for the job, and the operations it could perform and its core goals were well-defined. Similarly to how LLMs cracked coding first because there was an objective feedback loop to optimise, car was the obvious one (in retrospective) for AI in the physical world.
Everyone (I hope) that owns a car knows how to drive it. Tesla managed to ship an attractive EV that people would buy, drive, and collectively pull the real-world data required to eventually teach an artificial brain how to autonomously drive one of these robots with wheels. Having access to all of this raw data of the physical world in virtually all possible kinds of scenarios, environments and locations, is what has enabled Tesla to finally crack SFD.
Tesla’s fleet has now passed 10 billion miles driven on FSD, adding roughly a million miles a day. Since FSD v12 the system has been a single end-to-end neural network, vision only, with the old hand-written rules torn out. It learned to drive by watching the fleet drive.
Notice the order of operations. Tesla didn’t sell cars and then bolted on autonomy as a side project. The car was the data-collection programme. Every vehicle on the road has eight cameras recording how real people handle real roads in bad weather, and that stream is what trains the next model. They built the perfect sandbox environment and data flywheel to train their self-driving models. Waymo, with better sensors (because Tesla only uses visual sensors) and a smaller fleet, has spent years unable to (so far) out-engineer Tesla’s simple advantage.
One of the reasons why Tesla could build this flywheel is because the “body”, “environment”, and “rules” for these robots were pretty well-defined. You cannot collect ten billion miles of driving data without ten million things shaped like cars already driving around. Form is the precondition for data, not the other way round. That is the move every physical-AI company is now trying to repeat, and it is much harder when the task is folding laundry instead of staying in a lane.
If we treat Tesla cars as the first instance of autonomous physical robots, I think there’s a lot of learnings that we can extract and immediately apply to the field of robotics.
The two things a body still can’t do
If form is the precondition, the obvious question is why we don’t already have the right forms everywhere. The bodies seem to exist. Figure, Optimus and Unitree all walk, balance and grasp, and bipedal locomotion that took the field decades is close to a solved engineering problem. So what’s missing?
Two things, and neither is intelligence in the abstract sense. The model can already plan. What it can’t reliably do is feel, and what we can’t yet cheaply do is teach it the specific task. While we can consider a car like a “narrow body” for a “narrow task”, I feel like humanoid robots are a general-purpose form factor to whom we could teach a great gamut of tasks that we already do ourselves. We just need to teach them how we do it.
Driving is a vision problem, and vision is the sense AI is best at and one of the first ones it was able to crack (do you remember the amazing things that convolutional neural networks, a.k.a CNNs were, able to do a decade ago?). Folding a shirt is not. It needs touch, the kind that adjusts grip force when a fabric starts to slip, and dexterous tactile hands are the part of the body that still lags the rest. The economics show where the difficulty sits: actuators alone run 30 to 40% of a high-end humanoid’s bill of materials, a single high-torque actuator costs thousands, and there are twenty or forty per robot. A car needs a steering rack. A hand that works by feel needs a torque-controlled actuator at every knuckle, and the supply chain for those parts is not built for volume yet. The right form for manipulation is genuinely harder to build than the right form for driving. The number of actuators that the models need to be able to activate for a specific action is significantly larger than in the case of a car.
Then there’s teaching them how to perform a task accurately. Large models learned language by reading the internet, text humanity had already written and left lying around for free. There is no equivalent corpus for physical action in the physical world in almost every possible environment and scenario (the kind of data corpus that Tesla collected through more than a decade of people driving their cars). No website stores the exact sequence of joint torques and micro-corrections in threading a cable behind a desk or lifting an egg without crushing it. That data has never been recorded and translated in the “senses” incorporated into these humanoid robots.
The workaround that many have tried is to use simulations of the physical world: like in RL environments, you let the model run a billion virtual attempts overnight to train. Tesla also did this for some years. It works right up to the sim-to-real gap, the point where the policy meets a real machine and the friction is slightly off, the actuator lags a few milliseconds, and the object deforms in a way the simulator never modelled. For humanoids that gap is wider than for four-legged robots, because more joints and more contact mean more ways for a small error to compound into a fall, and there is no general fix. Every team patches it by hand.
Put the two together and you get the real state of physical AI in 2026. The model is smart enough and the technology is there. The body can move. What’s missing is a body that can feel its way through a task it has never seen, and the mountain of real-world demonstrations needed to train it. That is why all those cool slick backflip videos from robots are just a demonstration of the form factor and specific actions being cracked, not a deployment. While we are getting close, we still are teaching these general-purpose robots how to perform specific tasks in different scenarios of the physical world.
Humanoids in 2026
The humanoid is the form that gets all the attention, because they are cool and they can move like us. With all that said about how hard manipulation is, the deployments are real, more real than I expected when I started looking.
Figure has robots on the line at BMW’s Spartanburg plant, running ten hours a day at better than 99% placement accuracy across more than a thousand operational hours. Tesla is targeting 50,000 Optimus units in 2026 inside its own factories, and Unitree will sell you a G1 today for around $16,000, the Model 3 of robots: cheap hardware at volume, worry about generality later (I see the “Tesla pattern” here of deploying the platform that enables data collection at scale).
But notice what all three have in common. The tasks are narrow: load this panel, place this part, etc. These are factory jobs in structured spaces, the closest a robot gets to a car’s nice clean lane. The robots are supervised, single-purpose, and a long way from the general-purpose machine that tidies your house. Map them onto the Tesla timeline and we’re roughly where the fleet was a decade ago: the hardware is out gathering experience, the autonomy is years of data away, with the caveat of the form factor.
And here’s the thing the humanoid hype obscures: for most of these jobs, the metal human is the wrong form. They could be using robots that are closer to what factory robots look today, but by using a humanoid form factor, we have a general-purpose platform that can be taught any kind of task that a human could do.
The robots that are working do not look human
Robots have been around for decades. China runs an operational stock of around two million industrial robots, and Xiaomi builds ten million phones a year in a lights-out factory at 81% automation, but all these robots operate through a specific script, the logic is hardcoded. The task and the operation needs to be clearly hardcoded and implemented, like it was the case of the early Teslas.
The thing actually changing in 2026 is that we are starting to see robots whose behaviour comes from a learned model instead of a fixed program, machines that can handle a situation nobody scripted in advance. That is what “AI for the physical world” means to me, and the most successful instances of this so far do not wear a humanoid shape. It shows up first in the jobs where the body is simple but the world is messy, which is exactly where old automation couldn’t go.
Agriculture is the clearest example. A fruit-picking arm sounds may seem something that can be implemented with classical automation until you look at what it has to do: find a ripe strawberry behind a leaf, under changing light, half-occluded by another berry, and decide in real time whether to pick it. That is a perception problem, and it’s being solved with the same deep-learning vision stack as everything else. Recent harvesters run models like YOLO-based ripeness detectors trained for occlusion and light changes in real orchards, and dedicated ripeness networks that judge a blueberry the way a picker would. The arm is dumb. The eyes are not. John Deere’s autonomous tractors carry a sixteen-camera vision rig and a perception model that reads the field as it goes, rather than following a pre-mapped line. The number of fruit farms running autonomous harvesters jumped from 950 in 2021 to over 4,300 in 2024, and that curve is bending now because they can now interpret the real world, not because anyone invented a new arm.
The deeper shift is the arrival of foundation models for action, the physical-world equivalent of GPT (somethin
[truncated for AI cost control]