2026-05-06 00:00 UTCOriginal source2 min readUpdated: 2026-06-27 00:25 UTC

From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs

Apple ML Research introduces SFI-Bench, a video-based benchmark with over 1,700 questions to evaluate multimodal LLMs on spatial and functional reasoning. Experiments show current models struggle to integrate spatial memory with functional knowledge, highlighting a critical bottleneck.

SourceApple Machine Learning Research

From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs - Apple Machine Learning Research

Machine Learning Research

Open MenuClose Menu

Overview

Research Highlights

Publications

Events

Work with us

research area Computer Visionconference CVPR

content type paperpublished May 2026

From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs

AuthorsLe Zhang†**, Jihan Yang‡, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, Cindy Pan, Hong Yu, Aishwarya Agrawal†, Bo-Hsiang Tseng

View publication

Copy Bibtex

True spatial intelligence for multimodal agents transcends low-level geometric perception, evolving from knowing where things are to understanding what they are for. While existing benchmarks, such as VSI-Bench, effectively evaluate this foundational geometric stage, they fall short of probing the higher-order cognitive abilities essential for grounded intelligence. To bridge this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1700 questions derived from diverse, egocentric indoor video scans. SFI-Bench is designed to systematically evaluate two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, inferring object affordances and context-dependent utility. Its tasks, including conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenge a model’s ability to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to integrate spatial memory with functional and external knowledge, highlighting a critical bottleneck. SFI-Bench thus provides an essential tool for measuring and driving progress towards more cognitively capable and truly grounded multimodal agents.

† Mila, Université de Montréal

‡ New York University

** Work done while at Apple