2026-06-19原文2 min readUpdated: 2026-06-19

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

This paper proposes PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Leveraging the parallel decoding nature of diffusion language models, it introduces efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, significantly improving inference efficiency. A new benchmark, ParaDLC-Bench, is constructed to evaluate parallelism in visual perception. Experiments show competitive performance with substantial speed improvements for multi-region tasks.

SourcearXiv Computer VisionAuthor: Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

[2606.19534] PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

[Submitted on 17 Jun 2026]

Title:PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

View a PDF of the paper titled PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models, by Yueyi Sun and 10 other authors

View PDF HTML (experimental)

Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

Comments: Code available at this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as: arXiv:2606.19534 [cs.CV]

(or arXiv:2606.19534v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.19534

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yueyi Sun [view email] [v1] Wed, 17 Jun 2026 19:27:55 UTC (12,947 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models, by Yueyi Sun and 10 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-06

Change to browse by:

cs cs.AI cs.CL

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)