2026-06-25 04:00 UTCOriginal source2 min readUpdated: 2026-06-25 07:59 UTC

Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety

Yuvion VL is a family of multimodal large language models purpose-built for content and AI safety, treating safety as an inherently adversarial and multimodal problem. It features an automated data pipeline with adversarial-aware synthesis and multi-stage quality control, a three-stage training pipeline including continued pretraining for cross-modal alignment, instruction post-training, and reasoning post-training, plus a novel Confuse-then-Contrast Fine-Tuning framework. The YVRE benchmark set evaluates safety, adversarial robustness, and real-world capabilities. Yuvion VL-32B achieves industry-leading safety performance, surpassing open-source and closed-source models while maintaining general capabilities.

SourcearXiv Computer VisionAuthor: Shikai Qiu, Xiaowen Xu, Benlei Cui, Ting Ma, Xiufeng Huang, Wenjing Jiang, Shaoxuan He, Haolei Xu, Chunyang Chai, Yujian Li, Yiliang Zhang, Guanghui Wang, Ziheng Wang, Ziwen Xu, Zhaoyu Fan, Jinhao Chen, Ruijie Jian, Hongxing Li, Chuxi Xiao, Xinyue Chen, Wenxuan Liu, Libin Dong, Yupeng Cao, Xiaoqian Xia, Jing Wang, Zhe Jiang, Zhenan Ye, Guang Yang, Bin Liu, Wei Peng, Ziqiang Zhu, Meihui Lian, Kaiwen Lv Kacuila, Haidong Ding, Dongjie Zhang, Yangfan Zhou, Bingyu Zhu, Yan Wang, Hai Zhao, Xuan Jin, Wei Zhao, Pengfei Sun, Huiming Zhang, Wei Wang, Xipeng Cao, Bin Li, Chengwen Yao, Meng Huang, Xianfeng Li, Bin Tang, Chao Liu, Hui Xue, Longtao Huang, Haiwen Hong

[2606.25034] Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety

[Submitted on 23 Jun 2026]

Title:Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety

View a PDF of the paper titled Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety, by Shikai Qiu and 53 other authors

View PDF HTML (experimental)

Abstract:General-purpose models often struggle to reliably identify and understand real-world multimodal risks, largely due to the inherent multimodal adversarial nature of content and AI safety. We present Yuvion VL, a family of multimodal large language models purpose-built for content and AI safety, with both instruction-tuned and reasoning-oriented variants. Yuvion VL addresses this gap by treating safety as an inherently adversarial and multimodal problem and designing the entire pipeline around adversarial robustness. For data construction, we develop an automated pipeline integrating adversarial-aware data synthesis with multi-stage quality control, producing large-scale, high-quality multimodal samples augmented with domain knowledge and reasoning annotations. For training, we adopt a three-stage pipeline that includes continued pretraining for risk-concept cross-modal alignment, instruct post-training for production-grade safety tasks, and reasoning post-training for enhanced interpretability and performance in complex tasks. We further introduce Confuse-then-Contrast Fine-Tuning, a contrastive framework that mines model-specific confusions and constructs multi-image contrastive groups to enforce explicit discrimination of fine-grained visual-semantic elements, enabling the model to distinguish between visually similar cases with different safety implications in adversarial safety tasks. To support rigorous evaluation, we further introduce Yuvion VL RiskEval (YVRE), a collection of benchmarks covering diverse open and internal evaluations, with a focus on content and AI safety, adversarial robustness, and real-world capability requirements. Experiments show that Yuvion VL-32B achieves industry-leading safety performance, surpassing comparably sized open-source models and best closed-source commercial models, while maintaining comparable general capabilities.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2606.25034 [cs.CV]

(or arXiv:2606.25034v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.25034

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Benlei Cui [view email] [v1] Tue, 23 Jun 2026 18:00:08 UTC (15,106 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety, by Shikai Qiu and 53 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-06

Change to browse by:

cs cs.AI

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)