AI News HubLIVE
原文

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

A new benchmark called What-If World tests video generation models' causal reasoning by presenting paired prompts that differ in one physical detail and checking if videos diverge correctly. Evaluating nine state-of-the-art models, none exceed 52% on paired scores, with open-source models around 28%, indicating significant room for improvement. Performance correlates with visual prominence rather than physics tractability.

Article intelligence

EngineersAdvanced

Key points

  • What-If World benchmark uses 319 prompt pairs with single variable changes to test causal understanding in video generation models. It is built on real frames from nuScenes and DROID.
  • Scoring uses APEO rubric (Adherence, Physics, Environment, Outcome). All nine models struggle: best paired score is 52%, open-source models average 28%.
  • Model performance depends on visual prominence of the intervention, not physics difficulty. Subtle interventions score as low as 14.2%, pronounced ones reach 40.4%.

Why it matters

This matters because what-If World benchmark uses 319 prompt pairs with single variable changes to test causal understanding in video generation models. It is built on real frames from nuScenes and DROID.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

[2605.27589] What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

[Submitted on 26 May 2026]

Title:What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

View a PDF of the paper titled What-If World: A Causal Benchmark for General World Models in Embodied Scenarios, by Kunlin Cai and 9 other authors

View PDF HTML (experimental)

Abstract:Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model's output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts. The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What-If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation. Each pair is scored with APEO, a four-part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state-of-the-art models, no system exceeds 52% on the paired score, and open-source models cluster near 28%. Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action-conditioned simulation or model-based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14.2%, while visually pronounced ones reach 40.4%.

Comments: 38 pages, World Model Benchmark

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2605.27589 [cs.CV]

(or arXiv:2605.27589v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.27589

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Kunlin Cai [view email] [v1] Tue, 26 May 2026 19:02:26 UTC (1,497 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled What-If World: A Causal Benchmark for General World Models in Embodied Scenarios, by Kunlin Cai and 9 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-05

Change to browse by:

cs

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Loading...

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author

Venue

Institution

Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)