From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs
Apple ML Research introduces SFI-Bench, a video-based benchmark with over 1,700 questions to evaluate multimodal LLMs on spatial and functional reasoning. Experiments show current models struggle to integrate spatial memory with functional knowledge, highlighting a critical bottleneck.
From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs - Apple Machine Learning Research
Machine Learning Research
Open MenuClose Menu
Overview
Research Highlights
Publications
Events
Work with us
research area Computer Visionconference CVPR
content type paperpublished May 2026
From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs
AuthorsLe Zhang†**, Jihan Yang‡, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, Cindy Pan, Hong Yu, Aishwarya Agrawal†, Bo-Hsiang Tseng
View publication
Copy Bibtex
True spatial intelligence for multimodal agents transcends low-level geometric perception, evolving from knowing where things are to understanding what they are for. While existing benchmarks, such as VSI-Bench, effectively evaluate this foundational geometric stage, they fall short of probing the higher-order cognitive abilities essential for grounded intelligence. To bridge this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1700 questions derived from diverse, egocentric indoor video scans. SFI-Bench is designed to systematically evaluate two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, inferring object affordances and context-dependent utility. Its tasks, including conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenge a model’s ability to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to integrate spatial memory with functional and external knowledge, highlighting a critical bottleneck. SFI-Bench thus provides an essential tool for measuring and driving progress towards more cognitively capable and truly grounded multimodal agents.
† Mila, Université de Montréal
‡ New York University
** Work done while at Apple
Related readings and updates.
Does Spatial Cognition Emerge in Frontier Models?
March 5, 2025research area Computer Vision, research area Speech and Natural Language Processingconference ICLR
Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate…
Read more
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model
February 12, 2025research area Human-Computer Interaction, research area Speech and Natural Language Processingconference ICASSP
We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects. ImmerseDiffusion is trained to generate first-order ambisonics (FOA) audio, which is a conventional spatial audio format comprising four channels that can be rendered to multichannel spatial output. The proposed generative system is composed of a spatial…
Read more
Discover opportunities in Machine Learning.
Our research in machine learning breaks new ground every day.
Work with us
Machine Learning Research
Publications
From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs
Privacy Policy
Terms of Use
Legal
Copyright © 2026 Apple Inc. All rights reserved.