2026-06-02原文2 min readUpdated: 2026-06-02

Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection

arXiv:2606.00098v1 Announce Type: new Abstract: We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3's patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3's CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.

SourcearXiv Computer VisionAuthor: Izaldein Al-Zyoud, Abdulmotaleb El Saddik

[2606.00098] Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection

[Submitted on 25 May 2026]

Title:Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection

View a PDF of the paper titled Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection, by Izaldein Al-Zyoud and Abdulmotaleb El Saddik

View PDF HTML (experimental)

Abstract:We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3's patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3's CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Cite as: arXiv:2606.00098 [cs.CV]

(or arXiv:2606.00098v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.00098

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Izaldein Al-Zyoud [view email] [v1] Mon, 25 May 2026 17:07:00 UTC (1,534 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection, by Izaldein Al-Zyoud and Abdulmotaleb El Saddik

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-06

Change to browse by:

cs eess eess.IV

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)