2026-06-04 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection

A new benchmark, EVID-Bench, assesses search-based video misinformation detection. It includes 222 videos with 9 manipulation types in 3 categories. The best model achieves only 61.43% point-level and 43.24% video-level accuracy, with AI-generated manipulations being especially challenging.

SourcearXiv Computer VisionAuthor: Tao Yu, Yujia Yang, Shenghua Chai, Zhang Jinshuai, Haopeng Jin, Hao Wang, Minghui Zhang, Zhongtian Luo, Yuchen Long, Xinlong Chen, Jiabing Yang, Zhaolu Kang, Yuxuan Zhou, Zhengyu Man, Xinming Wang, Hongzhu Yi, Zheqi He, Xi Yang, Yan Huang, Liang Wang

[2606.04098] When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection

[Submitted on 2 Jun 2026]

Title:When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection

View a PDF of the paper titled When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection, by Tao Yu and 19 other authors

View PDF HTML (experimental)

Abstract:Video misinformation increasingly operates at the semantic and evidential level: authentic footage may be selectively edited, temporally reordered, spliced across sources, or augmented with AI-generated content to construct false narratives. Such evidence-dependent manipulations cannot be reliably verified from the input video alone, because the missing, reordered, replaced, or recontextualized evidence lies outside the video itself. We introduce \textbf{EVID-Bench}, a benchmark for search-grounded video misinformation detection, where a system must search the open web for related videos and identify what information is false through cross-video comparison. EVID-Bench comprises 222 videos spanning 9 manipulation types across 3 categories: AI generation, single-source editing, and multi-source editing. All samples are verified to be undetectable by frontier models through visual inspection alone. We evaluate nine frontier multimodal models using a retrieval-augmented verification baseline. The best system achieves only 61.43\% point-level accuracy and 43.24\% video-level accuracy, while AI-generated manipulations remain especially challenging. Error analysis reveals recurring challenges: models fixate on irrelevant anchors, misattribute synthetic content to editorial splicing, and terminate search prematurely before fully explaining the manipulation.

Comments: 52 pages

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2606.04098 [cs.CV]

(or arXiv:2606.04098v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.04098

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Tao Yu [view email] [v1] Tue, 2 Jun 2026 18:03:35 UTC (32,252 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection, by Tao Yu and 19 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)