2026-07-02 04:00 UTCOriginal source2 min readUpdated: 2026-07-02 08:19 UTC

PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

This paper presents PixelEyes, a multi-turn visual reasoning agent that decouples reasoning from perception to address the repeated localization failures of multimodal LLMs. It introduces mask-guided visual search and semantic-region breadth-first search, constructs the PixelEyes-6K dataset and Pinpoint-Bench benchmark, and demonstrates significant headroom for existing models.

SourcearXiv Computer VisionAuthor: Dengxian Gong, Yuanzheng Wu, Haobo Yuan, Zhengdong Hu, Tao Zhang, Yikang Zhou, Shihao Chen, Quanzhu Niu, Kai Wang, Jason Li, Haochen Wang, Lu Qi, Shunping Ji, Ming-Hsuan Yang

-->

[Submitted on 30 Jun 2026]

Title:PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

View a PDF of the paper titled PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking, by Dengxian Gong and 13 other authors

View PDF HTML (experimental)

Abstract:This paper explores multi-turn visual reasoning and observes that MLLMs repeatedly fail to localize the target, leading to long, redundant trajectories. We attribute this failure to the entanglement of reasoning and perception within a single model, the MLLM reasons and localizes simultaneously, and inaccurate localization triggers additional reasoning turns that bloat the trajectory. To solve this problem, we propose PixelEyes, a multi-turn visual reasoning agent that explicitly decouples reasoning from perception, i.e., the reasoner decides what to look for, while a specialized perception tool answers where it is. Specifically, PixelEyes introduces 1) Mask-guided Visual Search. A referring segmentation model is invoked to provide mask-precise localization, freeing the reasoner from the need to compensate for imprecise grounding. 2) Semantic-region Breadth-first Search (BFS). To eliminate redundant loops caused by repeatedly cropping incorrect sub-regions, we organize exploration as a breadth-first search over semantic regions. To internalize these capabilities, we construct the PixelEyes-6K dataset by resynthesizing expert trajectories from existing data. This explicitly embeds our mask-guided search and BFS logic into the model. We further introduce Pinpoint-Bench, a zero-hint visual search benchmark, i.e., no location cues are provided in the question, with instance-level masks and bounding boxes that separate localization failures from reasoning failures, enabling fine-grained analysis of failure modes such as inattentional blindness. Recent state-of-the-art MLLMs and visual reasoning agents leave large headroom on Pinpoint-Bench, demonstrating its quality and difficulty. Code and models are open-sourced.

Comments: 22pages, 10 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2607.00115 [cs.CV]

(or arXiv:2607.00115v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2607.00115

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Dengxian Gong [view email] [v1] Tue, 30 Jun 2026 19:51:54 UTC (35,369 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking, by Dengxian Gong and 13 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-07

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)