2026-06-05 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

Personal AI Agent for Camera Roll VQA

This paper introduces the personal camera roll visual question answering task, constructs the camroll dataset with 50 users, 31,476 images, and 2,500 QA pairs, and designs camroll-agent, a conversational AI agent with hierarchical memory and efficient tools. Experiments show it outperforms baselines, highlighting the need for specialized approaches to personalized visual memory.

SourcearXiv Computer VisionAuthor: Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li

[2606.05275] Personal AI Agent for Camera Roll VQA

[Submitted on 3 Jun 2026]

Title:Personal AI Agent for Camera Roll VQA

View a PDF of the paper titled Personal AI Agent for Camera Roll VQA, by Thao Nguyen and 4 other authors

View PDF HTML (experimental)

Abstract:We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., `Name of the food I tried yesterday?'') to more open-ended ones (e.g., `Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.

Comments: Project page, code, and demo: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2606.05275 [cs.CV]

(or arXiv:2606.05275v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.05275

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Thao Nguyen [view email] [v1] Wed, 3 Jun 2026 17:59:30 UTC (1,544 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Personal AI Agent for Camera Roll VQA, by Thao Nguyen and 4 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-06

Change to browse by:

cs cs.AI

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)