2026-07-02 04:00 UTCOriginal source2 min readUpdated: 2026-07-02 08:26 UTC

EmbodimentSemantic: A Spatial Scene-Graph Dataset and Benchmark for Vision-Language Models on Embodied Manipulation Trajectories

Researchers introduce EmbodimentSemantic, a dataset and benchmark for spatial grounding in vision-language-action systems. It represents scenes as directed object-relation-object triplets and includes real-world and simulator-grounded data. Experiments show current models struggle with depth-aware and viewpoint-dependent spatial structures.

SourcearXiv RoboticsAuthor: Hassan Jaber, Refinath S N, Luca Cagliero, Christopher E. Mower, Haitham Bou-Ammar

Article intelligence

EngineersAdvanced

Key points

EmbodimentSemantic uses directed object-relation-object triplets to explicitly represent spatial arrangements, enabling direct evaluation of object binding and relation prediction.
The dataset includes real-world observations from the low-cost SO101 robot arm and over 60K frames from the LIBERO simulator with paired views.
Evaluation across multiple VLMs reveals plausible relation predictions but poor performance on exact depth and viewpoint-dependent spatial understanding.
Injecting scene graphs into VLA policy prompts shows potential for improving downstream manipulation.

Why it matters

This matters because embodimentSemantic uses directed object-relation-object triplets to explicitly represent spatial arrangements, enabling direct evaluation of object binding and relation prediction.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

-->

[Submitted on 6 Jun 2026]

Title:EmbodimentSemantic: A Spatial Scene-Graph Dataset and Benchmark for Vision-Language Models on Embodied Manipulation Trajectories

View a PDF of the paper titled EmbodimentSemantic: A Spatial Scene-Graph Dataset and Benchmark for Vision-Language Models on Embodied Manipulation Trajectories, by Hassan Jaber and 4 other authors

View PDF HTML (experimental)

Abstract:Spatial grounding remains a key limitation of vision-language-action (VLA) systems for robotic manipulation. While current models can recognize objects and follow language instructions, they often lack an explicit representation of how objects are arranged in space, including support, containment, ordering, occlusion, and depth-sensitive relations. We introduce EmbodimentSemantic, a spatial scene-graph dataset and benchmark for evaluating relational grounding in embodied manipulation. EmbodimentSemantic represents scenes as directed object-relation-object triplets, where each triplet specifies a spatial relation between an ordered pair of objects using a fixed set of relations. This representation enables direct evaluation of object binding, relation prediction, and spatial consistency. The dataset includes real-world manipulation observations collected with the low-cost SO101 robot arm, together with generated scene graphs for studying spatial grounding in practical robotic settings. To provide controlled validation, we also introduce a simulator-grounded LIBERO benchmark with over 60K manipulation frames and more than 120K camera-specific scene graphs across paired third-person and wrist views, where ground-truth relations are derived automatically from MuJoCo geometry, world coordinates, camera projections, and visibility constraints. We further test whether scene graphs improve downstream control by injecting them into existing VLA policy prompts. Experiments across open-source and commercial VLMs show that current models often predict plausible relations but struggle with exact depth-aware and viewpoint-dependent spatial structure. EmbodimentSemantic provides a unified framework for diagnosing spatial grounding in VLM perception and testing its utility for VLA manipulation.

Subjects:

Robotics (cs.RO)

Cite as: arXiv:2607.00020 [cs.RO]

(or arXiv:2607.00020v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2607.00020

arXiv-issued DOI via DataCite

Submission history

From: Hassan Jaber [view email] [v1] Sat, 6 Jun 2026 18:58:54 UTC (3,622 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled EmbodimentSemantic: A Spatial Scene-Graph Dataset and Benchmark for Vision-Language Models on Embodied Manipulation Trajectories, by Hassan Jaber and 4 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.RO

new | recent | 2026-07

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)