ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation
ContactWorld benchmark spanning 12 contact-rich manipulation tasks reveals that spatially structured and temporally continuous representations, like point clouds, improve planning success rates from ~20% to 32.1%. Tactile sensing effectiveness depends on cross-modal compatibility; combining point clouds with tactile force fields achieves 36.1%. Tactile information becomes increasingly important for long-horizon planning.
[2606.13877] ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation
[Submitted on 11 Jun 2026]
Title:ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation
View a PDF of the paper titled ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation, by Zhiyuan Zhang and 8 other authors
View PDF HTML (experimental)
Abstract:Contact-rich manipulation requires world models to reason over complex contact dynamics from multimodal sensory observations. However, it remains unclear which representation properties fundamentally support stable long-horizon planning in contact-rich settings. In this paper, we present ContactWorld, a benchmark and systematic empirical study of vision-tactile world models spanning 12 contact-rich manipulation tasks, including insertion, disassembly, screwing, and exploratory interaction. Across extensive experiments, we find that representations that are both spatially structured and temporally continuous consistently achieve the strongest planning performance. In particular, point-cloud observations improve average planning success rates from 20.7% with wrist-view observations and 22.0% with front-view observations to 32.1%. We further find that the effectiveness of tactile sensing depends critically on cross-modal representation compatibility rather than modality scaling alone. Combining point-cloud observations with tactile force-field representations, which preserve richer spatial structure and interaction dynamics, further improves performance to 36.1%, yielding the strongest overall planning performance across all evaluated tasks. Moreover, tactile sensing becomes increasingly important under long-horizon planning objectives, where compounding prediction errors and contact uncertainty accumulate over time. Together, these findings highlight the importance of representation structure, multimodal compatibility, and long-horizon robustness in vision-tactile world models for contact-rich robotic manipulation.
Comments: 32 pages, 12 figures, supplementary material included
Subjects:
Robotics (cs.RO)
Cite as: arXiv:2606.13877 [cs.RO]
(or arXiv:2606.13877v1 [cs.RO] for this version)
https://doi.org/10.48550/arXiv.2606.13877
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Zhiyuan Zhang [view email] [v1] Thu, 11 Jun 2026 20:01:49 UTC (6,160 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation, by Zhiyuan Zhang and 8 other authors
View PDF
HTML (experimental)
TeX Source
view license
Current browse context:
cs.RO
new | recent | 2026-06
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Loading...
Data provided by:
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)