2026-06-15原文2 min readUpdated: 2026-06-15

ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation

ContactWorld benchmark spanning 12 contact-rich manipulation tasks reveals that spatially structured and temporally continuous representations, like point clouds, improve planning success rates from ~20% to 32.1%. Tactile sensing effectiveness depends on cross-modal compatibility; combining point clouds with tactile force fields achieves 36.1%. Tactile information becomes increasingly important for long-horizon planning.

SourcearXiv RoboticsAuthor: Zhiyuan Zhang, Pokuang Zhou, Kaidi Zhang, Adeesh Desai, Temitope Amosa, Davood Soleymanzadeh, Jiuzhou Lei, Minghui Zheng, Yu She

[2606.13877] ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation

[Submitted on 11 Jun 2026]

Title:ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation

View a PDF of the paper titled ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation, by Zhiyuan Zhang and 8 other authors

View PDF HTML (experimental)

Abstract:Contact-rich manipulation requires world models to reason over complex contact dynamics from multimodal sensory observations. However, it remains unclear which representation properties fundamentally support stable long-horizon planning in contact-rich settings. In this paper, we present ContactWorld, a benchmark and systematic empirical study of vision-tactile world models spanning 12 contact-rich manipulation tasks, including insertion, disassembly, screwing, and exploratory interaction. Across extensive experiments, we find that representations that are both spatially structured and temporally continuous consistently achieve the strongest planning performance. In particular, point-cloud observations improve average planning success rates from 20.7% with wrist-view observations and 22.0% with front-view observations to 32.1%. We further find that the effectiveness of tactile sensing depends critically on cross-modal representation compatibility rather than modality scaling alone. Combining point-cloud observations with tactile force-field representations, which preserve richer spatial structure and interaction dynamics, further improves performance to 36.1%, yielding the strongest overall planning performance across all evaluated tasks. Moreover, tactile sensing becomes increasingly important under long-horizon planning objectives, where compounding prediction errors and contact uncertainty accumulate over time. Together, these findings highlight the importance of representation structure, multimodal compatibility, and long-horizon robustness in vision-tactile world models for contact-rich robotic manipulation.

Comments: 32 pages, 12 figures, supplementary material included

Subjects:

Robotics (cs.RO)

Cite as: arXiv:2606.13877 [cs.RO]

(or arXiv:2606.13877v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2606.13877

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zhiyuan Zhang [view email] [v1] Thu, 11 Jun 2026 20:01:49 UTC (6,160 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation, by Zhiyuan Zhang and 8 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.RO

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)