Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning
This paper argues that despite rapid performance gains of Vision-Language-Action (VLA) models on robot manipulation benchmarks, current evaluation metrics cannot distinguish semantic from physical generalization, thus failing to verify physical reasoning abilities. The authors propose evaluation designs with controlled variation to separately measure these two capabilities.
-->
[Submitted on 28 Jun 2026]
Title:Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning
View a PDF of the paper titled Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning, by Taozhao Chen and Ian Manchester and Huaming Chen
View PDF HTML (experimental)
Abstract:Vision-Language-Action (VLA) systems, built on pretrained vision-language models (VLMs), have shown rapidly improving performance on robot manipulation benchmarks. These gains are commonly interpreted as evidence that semantic representations learned from internet-scale data transfer to physical execution generalization. This position paper argues that the assumption underlying this interpretation -- that semantic generalization is sufficient to support physical action decisions -- has not been independently verified and cannot be tested under current evaluation protocols. We support this claim by decomposing VLA policies into semantic mapping and physical action decision, and showing that task success rate -- the dominant evaluation metric -- cannot distinguish between these two sources of capability. As a result, improvements in benchmark performance are consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization. We further argue that this identifiability gap has been reinforced through narrative drift, whereby successive systems inherit and strengthen prior interpretations of performance gains without isolating the underlying causal mechanism. To address this limitation, we propose a research direction based on evaluation designs that introduce controlled variation to separately measure semantic and physical generalization. Such designs make it possible to causally attribute performance without requiring access to model internals, and to empirically assess the role of VLM backbones as semantic interfaces rather than implicit sources of physical competence. Our goal is not to refute the role of VLMs in robotics, but to clarify the conditions under which claims of physical generalization can be meaningfully evaluated.
Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
Cite as: arXiv:2606.30686 [cs.RO]
(or arXiv:2606.30686v1 [cs.RO] for this version)
https://doi.org/10.48550/arXiv.2606.30686
arXiv-issued DOI via DataCite
Submission history
From: Taozhao Chen [view email] [v1] Sun, 28 Jun 2026 14:03:57 UTC (61 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning, by Taozhao Chen and Ian Manchester and Huaming Chen
View PDF
HTML (experimental)
TeX Source
view license
Current browse context:
cs.RO
new | recent | 2026-06
Change to browse by:
cs cs.AI
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Loading...
Data provided by:
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)