2026-05-25 06:56 UTCIn-site rewrite2 min readUpdated: 2026-06-30 13:03 UTC

"VLA and World Models Are Not the Endgame; There Will Be a Model Unique to the Physical World" | Ant Lingbo's Shen Yujun @ AIGC2026

At the 2026 China AIGC Industry Summit, Shen Yujun, Chief Scientist of Ant Lingbo Technology, argued that large models have benefited from decades of internet data, but robotics still faces a data vacuum in the physical world. He believes that neither VLA nor world models alone will be the final solution for embodied intelligence; instead, they will converge into a model unique to the physical world. Ant Lingbo positions itself as the 'general brain' for robots, akin to an operating system, with a focus on spatial perception. Shen predicts that around 2028, when everyone can contribute data to robots, embodied intelligence will have its 'ChatGPT moment'.

Source量子位Author: 一水

At the 2026 China AIGC Industry Summit, hosted by QbitAI, Shen Yujun, Chief Scientist of Ant Lingbo Technology, engaged in a deep discussion with Li Gen, co-founder and editor-in-chief of QbitAI, on the transition from AIGC to AIGA (AI Generated Action). Shen, a seasoned expert in robotics, argued that while large language models have capitalized on decades of internet data, the field of robotics is still in a data desert when it comes to physical-world interactions.

Shen introduced the concept of AIGA, emphasizing that the next phase of AI 2.0 should move from digital entertainment to physical productivity. He stressed that generating content is not enough; AI must generate actions to provide tangible services. Ant Lingbo, a subsidiary of Ant Group, has positioned itself as a provider of a 'general brain' for robots, similar to an operating system for smartphones. The company focuses on the intelligence layer, leaving hardware to specialized manufacturers.

A key technical challenge, according to Shen, is spatial perception. Robots need to understand the physical world through sensors like depth cameras and force sensors. While the industry is currently focused on VLA (Vision-Language-Action) and world models, Shen believes that neither approach alone will be the ultimate solution. VLA excels at human-robot interaction and immediate tasks, while world models are better at predicting future states. Ultimately, the two will converge into a model unique to the physical world, designed from the ground up for robotics.

Data standardization is another critical factor. Without standardized data, scaling robot intelligence is impossible. Shen predicts that within one to two years, there will be benchmark demonstrations of models actually deployed in commercial settings. Within two to three years, these examples will be replicated across industries. After that, robots will begin to enter consumer markets in limited roles, gradually spreading to households, much like electric vehicles today.

The 'ChatGPT moment' for embodied intelligence, Shen argues, will come when ordinary people can contribute data to train robots. This could happen around 2028, after a period of磨合 between model developers and data companies. Ant Lingbo aspires to be the Android of the robot era, providing the universal operating system that enables diverse hardware to operate intelligently.

Throughout the conversation, Shen emphasized that the path to general robot intelligence requires solving data, perception, and model integration. He dismissed the idea that any single current approach – VLA or world model – represents the endgame. Instead, he envisions a future where robots learn from real-world interactions and feedback, leading to a new form of intelligence that is deeply rooted in the physical world.