AI News HubLIVE
Original source2 min read

From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs

Apple ML Research introduces SFI-Bench, a video-based benchmark with over 1,700 questions to evaluate multimodal LLMs on spatial and functional reasoning. Experiments show current models struggle to integrate spatial memory with functional knowledge, highlighting a critical bottleneck.

From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs - Apple Machine Learning Research

Machine Learning Research

Open MenuClose Menu

Overview

Research Highlights

Publications

Events

Work with us

research area Computer Visionconference CVPR

content type paperpublished May 2026

From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs

AuthorsLe Zhang†**, Jihan Yang‡, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, Cindy Pan, Hong Yu, Aishwarya Agrawal†, Bo-Hsiang Tseng

View publication

Copy Bibtex

True spatial intelligence for multimodal agents transcends low-level geometric perception, evolving from knowing where things are to understanding what they are for. While existing benchmarks, such as VSI-Bench, effectively evaluate this foundational geometric stage, they fall short of probing the higher-order cognitive abilities essential for grounded intelligence. To bridge this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1700 questions derived from diverse, egocentric indoor video scans. SFI-Bench is designed to systematically evaluate two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, inferring object affordances and context-dependent utility. Its tasks, including conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenge a model’s ability to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to integrate spatial memory with functional and external knowledge, highlighting a critical bottleneck. SFI-Bench thus provides an essential tool for measuring and driving progress towards more cognitively capable and truly grounded multimodal agents.

† Mila, Université de Montréal

‡ New York University

** Work done while at Apple

Related readings and updates.

Does Spatial Cognition Emerge in Frontier Models?

March 5, 2025research area Computer Vision, research area Speech and Natural Language Processingconference ICLR

Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate…

Read more

ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model

February 12, 2025research area Human-Computer Interaction, research area Speech and Natural Language Processingconference ICASSP

We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects. ImmerseDiffusion is trained to generate first-order ambisonics (FOA) audio, which is a conventional spatial audio format comprising four channels that can be rendered to multichannel spatial output. The proposed generative system is composed of a spatial…

Read more

Discover opportunities in Machine Learning.

Our research in machine learning breaks new ground every day.

Work with us

Machine Learning Research

Publications

From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs

Privacy Policy

Terms of Use

Legal

Copyright © 2026 Apple Inc. All rights reserved.