2026-06-19原文2 min readUpdated: 2026-06-19

DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy

The paper presents DiffusionVS, a diffusion-based visual servoing method that uses conditional denoising to generate camera velocity and online training for improved generalization. It achieves nearly 100% success in simulation and 93% in physical experiments, and can be integrated into existing visual servoing networks to boost performance.

SourcearXiv RoboticsAuthor: Hongkang Cui, Rui He, Haoyao Chen

Article intelligence

EngineersAdvanced

Key points

Visual servoing is critical for robotic manipulation and navigation, but regression-based methods suffer from noise and error accumulation.
DiffusionVS uses normalized image coordinates as input and outputs camera velocity via conditional denoising.
An online training paradigm continuously expands training data diversity, enhancing generalization.
Success rates near 100% in simulation and 93% in real experiments; the diffusion module can enhance existing visual servoing networks.

Why it matters

This matters because visual servoing is critical for robotic manipulation and navigation, but regression-based methods suffer from noise and error accumulation.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

[2606.19397] DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy

[Submitted on 17 Jun 2026]

Title:DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy

View a PDF of the paper titled DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy, by Hongkang Cui and 1 other authors

View PDF HTML (experimental)

Abstract:Visual servoing is a fundamental technique in robotic manipulation and navigation. Regression-based visual servoing frequently experiences trajectory jitter as a result of noise-sensitive single-step mappings and the accumulation of errors during distribution shifts. In contrast, Diffusion Policy maintains temporal consistency by predicting action sequences and improves robustness through implicit data augmentation.

This paper presents a novel diffusion-based servoing method. Based on Diffusion Policy, the proposed approach uses normalized image coordinates of observed tag corners as input and generates camera velocity through conditional denoising. To overcome the generalization limitations of models trained on static datasets, an online training paradigm is adopted, continuously expanding the diversity of training data through interactive experience collection. This strategy substantially enhances both the performance and generalization capability of the model. Comprehensive simulations and real-world experiments demonstrate the effectiveness of the proposed method, achieving success rates of nearly 100\% in simulation and 93\% in physical experiments. Beyond the specific pipeline, we further validate the generality of the diffusion mechanism. Experiments show that existing visual servoing networks consistently achieve improved performance when integrated with our diffusion-based module. These results indicate that the proposed strategy possesses broad applicability and can enhance various visual servoing systems beyond the specific architecture presented here.

Comments: 8 pages, 4 figures, 7 tables

Subjects:

Robotics (cs.RO)

Cite as: arXiv:2606.19397 [cs.RO]

(or arXiv:2606.19397v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2606.19397

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Hongkang Cui [view email] [v1] Wed, 17 Jun 2026 08:06:05 UTC (2,709 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy, by Hongkang Cui and 1 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.RO

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)