2026-06-29 04:00 UTCOriginal source2 min readUpdated: 2026-06-29 08:12 UTC

Aloe-Vision: Robust Vision-Language Models for Healthcare

Aloe-Vision introduces a family of open-source medical vision-language models trained on a large-scale quality-filtered dataset, achieving balanced performance and exposing vulnerabilities to adversarial inputs.

SourcearXiv Computer VisionAuthor: Jaume Guasch-Mart\'i, Enrique Lopez-Cuena, Mart\'in Su\'arez-Fern\'andez, Jordi Bayarri-Planas, Anna Arias-Duart, Dario Garcia-Gasulla

[2606.27500] Aloe-Vision: Robust Vision-Language Models for Healthcare

[Submitted on 25 Jun 2026]

Title:Aloe-Vision: Robust Vision-Language Models for Healthcare

View a PDF of the paper titled Aloe-Vision: Robust Vision-Language Models for Healthcare, by Jaume Guasch-Mart\'i and 5 other authors

View PDF HTML (experimental)

Abstract:Large Vision-Language Models (LVLMs) specialized in healthcare are emerging as a promising research direction due to their potential impact in clinical and biomedical applications. However, progress is constrained by the scarcity of high-quality medical multimodal data, concerns about robustness in safety-critical settings, and the narrow and potentially contaminated evaluation benchmarks that limit reliable assessment. To address these issues, the field requires state-of-the-art solutions to be fully open and reproducible systems in which all components can be inspected, evaluated, and improved. This work introduces Aloe-Vision-Data, a large-scale, quality-filtered mixture which integrates both medical and general domains across multimodal and text-only sources, designed for direct use in model fine-tuning. Building on this dataset, we train the Aloe-Vision family of medical LVLMs, openly released with full weights, training recipes and data, in two scales (7B and 72B). Through comprehensive benchmarking, we demonstrate that high quality training mixtures produce balanced LVLMs which yield significant gains over the baseline models without compromising general capabilities, achieving competitive performance with respect to state-of-the-art alternatives. To support reliable evaluation, we introduce CareQA-Vision, a carefully curated vision benchmark derived from MIR and EIR exams, the residency entrance exams for medical and nursing specialists in Spain, offering novel vision questions with low likelihood of contamination. Finally, we show that current LVLMs remain vulnerable to adversarial and misleading inputs, underscoring reliability challenges in clinical contexts.

Comments: MIDL 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Cite as: arXiv:2606.27500 [cs.CV]

(or arXiv:2606.27500v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.27500

arXiv-issued DOI via DataCite (pending registration)

Journal reference: Proceedings of Machine Learning Research, Vol. 315, pp. 2404-2426, 2026

Submission history

From: Jaume Guasch-Martí [view email] [v1] Thu, 25 Jun 2026 19:36:38 UTC (7,959 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Aloe-Vision: Robust Vision-Language Models for Healthcare, by Jaume Guasch-Mart\'i and 5 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-06

Change to browse by:

cs cs.CL

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)