2026-06-03 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging

This paper presents an automated agent-driven pipeline that generates multiple-choice VQA datasets from paired private radiology reports and 3D oncology imaging, producing two complementary question types: RADS-style and report-derived questions. Evaluated on four in-house cancer cohorts, zero-shot evaluation reveals no dominant model and substantial headroom. A blind ablation shows visual reliance is dataset-specific; lung CT is solvable without images. The pipeline is released as an open agent skill.

SourcearXiv Computer VisionAuthor: Bo Liu, Hanxue Gu, Xiangru Li, Zheren Zhu, Jacob Ellison, Kang Wang, Janine M. Lupo, Yang Yang, Hui Lin

[2606.02809] Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging

[Submitted on 1 Jun 2026]

Title:Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging

View a PDF of the paper titled Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging, by Bo Liu and 8 other authors

View PDF HTML (experimental)

Abstract:Evaluating vision-language models (VLMs) on medical images requires benchmarks that are clinically grounded, scalable, and controlled for evaluation confounds. Existing public benchmarks are limited in scale, manually annotated, or potentially leaked into VLM pretraining corpora. We present an automated agent-driven pipeline that generates multiple-choice VQA datasets directly from paired private radiology reports and 3D oncology imaging, producing two complementary question types: RADS-style questions deterministically derived from clinician-defined reporting schemas, and radiology report-derived questions generated by an LLM from radiologist findings and verified against the source report. Applied to four in-house cancer cohorts, the pipeline yields an instance-contamination-controlled benchmark without per-question human annotation. Zero-shot evaluation of six VLMs reveals no dominant model and substantial headroom across all cells. A blind ablation reveals that visual reliance is highly dataset-specific: liver Report-derived questions genuinely require the image, while Lung CT is essentially solvable without it - the leading closed model exceeds its sighted accuracy on Lung CT when blinded - indicating that even private clinical data does not guarantee a contamination-controlled read of visual capability. The pipeline is released as an open agent skill for in-house redeployment.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2606.02809 [cs.CV]

(or arXiv:2606.02809v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.02809

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Bo Liu [view email] [v1] Mon, 1 Jun 2026 19:27:42 UTC (2,836 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging, by Bo Liu and 8 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)