2026-06-18原文2 min readUpdated: 2026-06-18

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

PAIWorld introduces a diffusion-transformer framework with geometry-aware cross-view attention, geometric rotary position embedding, and latent 3D-REPA distillation to achieve multi-view 3D consistency for robotic manipulation, ranking 1st on WorldArena and 2nd on AgiBot-Challenge2026.

SourcearXiv RoboticsAuthor: Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu, Kai Xu

[2606.18375] PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

[Submitted on 16 Jun 2026]

Title:PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

View a PDF of the paper titled PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation, by Yuhang Huang and 27 other authors

View PDF HTML (experimental)

Abstract:World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.

Subjects:

Robotics (cs.RO)

Cite as: arXiv:2606.18375 [cs.RO]

(or arXiv:2606.18375v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2606.18375

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yuhang Huang [view email] [v1] Tue, 16 Jun 2026 18:23:23 UTC (16,051 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation, by Yuhang Huang and 27 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.RO

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)