Hugging Face Daily Papers · June 5, 2026 · 7 min read

Personal AI Agent for Camera Roll VQA

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.","html":"<a href=\"https://cdn-uploads.huggingface.co/production/uploads/634ef841de30ee20582b355a/eU59_YSKuY51AxUabtB-b.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/634ef841de30ee20582b355a/eU59_YSKuY51AxUabtB-b.png\" alt=\"Screenshot 2026-06-04 at 7.51.25 PM\"></a>\n\"if an AI could see your whole camera roll, what would you ask?\" . project page: <a href=\"https://thaoshibe.github.io/camroll/\" rel=\"nofollow\">https://thaoshibe.github.io/camroll/</a> . github: <a href=\"https://github.com/thaoshibe/camroll/\" rel=\"nofollow\">https://github.com/thaoshibe/camroll/</a>\n<blockquote>\nWe study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., <code>Name of the food I tried yesterday?'') to more open-ended ones (e.g., </code>Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.\n</blockquote>\n","updatedAt":"2026-06-05T01:52:37.277Z","author":{"_id":"634ef841de30ee20582b355a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634ef841de30ee20582b355a/7W9HHzEjURmUPkQ7U_Nnl.png","fullname":"Thao Nguyen","name":"thaoshibe","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8240227699279785},"editors":["thaoshibe"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/634ef841de30ee20582b355a/7W9HHzEjURmUPkQ7U_Nnl.png"],"reactions":[{"reaction":"❤️","users":["thaoshibe","plnguyen2908"],"count":2},{"reaction":"🚀","users":["thaoshibe","plnguyen2908"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05275","authors":[{"_id":"6a2225063490a593e87b1391","name":"Thao Nguyen","hidden":false},{"_id":"6a2225063490a593e87b1392","name":"Krishna Kumar Singh","hidden":false},{"_id":"6a2225063490a593e87b1393","name":"Donghyun Kim","hidden":false},{"_id":"6a2225063490a593e87b1394","name":"Yong Jae Lee","hidden":false},{"_id":"6a2225063490a593e87b1395","name":"Yuheng Li","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/634ef841de30ee20582b355a/E5XiKHft_YYHZhVjz3aDy.png"],"publishedAt":"2026-06-03T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"Personal AI Agent for Camera Roll VQA","submittedOnDailyBy":{"_id":"634ef841de30ee20582b355a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634ef841de30ee20582b355a/7W9HHzEjURmUPkQ7U_Nnl.png","isPro":true,"fullname":"Thao Nguyen","user":"thaoshibe","type":"user","name":"thaoshibe"},"summary":"We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.","upvotes":15,"discussionId":"6a2225063490a593e87b1396","projectPage":"https://thaoshibe.github.io/camroll/","githubRepo":"https://github.com/thaoshibe/camroll","githubRepoAddedBy":"user","ai_summary":"A conversational AI agent is developed for personal camera roll visual question answering, featuring hierarchical memory and specialized tools for navigating large visual datasets with personalized content.","ai_keywords":["visual question answering","conversational AI assistant","personal camera roll","hierarchical memory","long-context understanding","visual memory","personalized content","AI agents"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":5},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"634ef841de30ee20582b355a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634ef841de30ee20582b355a/7W9HHzEjURmUPkQ7U_Nnl.png","isPro":true,"fullname":"Thao Nguyen","user":"thaoshibe","type":"user"},{"_id":"64af72d4a609b29cc7b5919b","avatarUrl":"/avatars/bc33b6bfa6995ea953f71366184f19d3.svg","isPro":false,"fullname":"Aniket Rege","user":"aniketr","type":"user"},{"_id":"677f8ec859ee993c8379d2a7","avatarUrl":"/avatars/e75ef1d17e8e4c8777a2818be281ac4e.svg","isPro":false,"fullname":"Hyunjung Lee","user":"hyulee","type":"user"},{"_id":"6a222f0eed9b6b7b1ce158cf","avatarUrl":"/avatars/a3da476353aeaca5287689cd9fc364cf.svg","isPro":false,"fullname":"Bơ và Mắm","user":"boandmam","type":"user"},{"_id":"6a14f6dc50c337c50a421505","avatarUrl":"/avatars/218034bbb18421dc3b79ebf59a8493f9.svg","isPro":false,"fullname":"Karan Thakor","user":"Qdot010","type":"user"},{"_id":"6a222fbdb9a0c574d5c2b34e","avatarUrl":"/avatars/0cb03422d2c17c3e95f6f1ee3debeae7.svg","isPro":false,"fullname":"桃软","user":"ttnguyen52","type":"user"},{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","isPro":false,"fullname":"Harris Zhang","user":"HanSolo9682","type":"user"},{"_id":"6a2231c109002e837ab507ce","avatarUrl":"/avatars/abc6bb3cdd4d0b3948acff3499dde864.svg","isPro":false,"fullname":"viet-wics","user":"vietwics","type":"user"},{"_id":"670f5267d1b58394145c1ca3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/t-YchgvZCbDW-plR8DZbA.png","isPro":false,"fullname":"Jaden Park","user":"jpark677","type":"user"},{"_id":"643684cdc81d1646aa6b93a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643684cdc81d1646aa6b93a5/iyLQfKMBCJJjfOJB2zvfp.png","isPro":false,"fullname":"Viet Nguyen","user":"viettmab","type":"user"},{"_id":"660047c56ab19a1d21e9d764","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/PFPd87Uh5Z3Y5WoCnJqIx.png","isPro":false,"fullname":"Le Thien Phuc Nguyen","user":"plnguyen2908","type":"user"},{"_id":"634f8bb3d049354d7ee94913","avatarUrl":"/avatars/2f4e41418f7dd38e2fe39339ddea3118.svg","isPro":false,"fullname":"Sicheng Mo","user":"Sicheng","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.05275.md"}">

Papers

arxiv:2606.05275

Personal AI Agent for Camera Roll VQA

Published on Jun 3

· Submitted by

Thao Nguyen on Jun 5

Upvote

Authors:

Abstract

A conversational AI agent is developed for personal camera roll visual question answering, featuring hierarchical memory and specialized tools for navigating large visual datasets with personalized content.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

View arXiv page View PDF Project page GitHub 5 Add to collection

Community

thaoshibe

Paper submitter about 9 hours ago

"if an AI could see your whole camera roll, what would you ask?"
. project page: https://thaoshibe.github.io/camroll/
. github: https://github.com/thaoshibe/camroll/

We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., Name of the food I tried yesterday?'') to more open-ended ones (e.g., Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.05275

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05275 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.05275 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05275 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Personal AI Agent for Camera Roll VQA

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers