Hugging Face Daily Papers · · 5 min read

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Figure 1: Overview of JoyAI-VL-Interaction.<br><a href=\"https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/owlOnNXdxxvHHqrqWz2As.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/owlOnNXdxxvHHqrqWz2As.png\" alt=\"image\"></a></p>\n<p>Figure 3: Overview of the JoyAI-VL-Interaction System.<br><a href=\"https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/2hY3_Fb-arSAGEzqAhUz9.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/2hY3_Fb-arSAGEzqAhUz9.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-06-16T04:09:12.702Z","author":{"_id":"66f4d2b62dc07e76a1dc464b","avatarUrl":"/avatars/b1884b750a6ec46a83d247e933d59d63.svg","fullname":"steven young","name":"iieycx","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6241653561592102},"editors":["iieycx"],"editorAvatarUrls":["/avatars/b1884b750a6ec46a83d247e933d59d63.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.14777","authors":[{"_id":"6a30c3bfa0d4daae4285fefd","name":"Dingyu Yao","hidden":false},{"_id":"6a30c3bfa0d4daae4285fefe","name":"Junhao Zhou","hidden":false},{"_id":"6a30c3bfa0d4daae4285feff","name":"Chenxu Yang","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff00","name":"Chuanyu Qin","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff01","name":"Haowen Hou","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff02","name":"Zheming Liang","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff03","name":"Congcong Wang","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff04","name":"Yuhang Cao","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff05","name":"Shenglong Ye","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff06","name":"Shuai Xie","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff07","name":"Shuhuan Gu","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff08","name":"Haoyang Huang","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff09","name":"Qingyi Si","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff0a","name":"Nan Duan","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff0b","user":{"_id":"64b4eec4faa3181a5eab9c46","avatarUrl":"/avatars/bcc9bf5cbf67546ad2b4c9ec8b96ac96.svg","isPro":true,"fullname":"Jiaqi Wang","user":"myownskyW7","type":"user","name":"myownskyW7"},"name":"Jiaqi Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:05:58.790Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/rPwOTrTL9yUkRzOgUfNoM.mp4","https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/BunI2wEEfCDQcodB7q3uq.mp4"],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence","submittedOnDailyBy":{"_id":"66f4d2b62dc07e76a1dc464b","avatarUrl":"/avatars/b1884b750a6ec46a83d247e933d59d63.svg","isPro":false,"fullname":"steven young","user":"iieycx","type":"user","name":"iieycx"},"summary":"Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.","upvotes":152,"discussionId":"6a30c3bfa0d4daae4285ff0c","projectPage":"https://joyai-vl-video-future-academy-jd.github.io/JoyAI-VL-Interaction/","githubRepo":"https://github.com/jd-opensource/JoyAI-VL-Interaction","githubRepoAddedBy":"user","ai_summary":"A vision-language model operates continuously in real-time, making autonomous decisions about when to respond or delegate, enabling interactive systems that perceive and act upon environmental changes without user prompting.","ai_keywords":["vision-language model","real-time interaction","autonomous decision-making","vision-triggered responsiveness","time awareness","background model","deployable system","video streaming","interactive AI","continuous perception"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":100,"organization":{"_id":"68cb7c874f4a1865540c455c","name":"jdopensource","fullname":"JD.com Open Source","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68c0e2ab44ea28a974e3074b/g-4gTubd16qUtwmGZ0n4h.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f4d2b62dc07e76a1dc464b","avatarUrl":"/avatars/b1884b750a6ec46a83d247e933d59d63.svg","isPro":false,"fullname":"steven young","user":"iieycx","type":"user"},{"_id":"65dda7969f0a3506f7986a1a","avatarUrl":"/avatars/32c720b2d0070de401cef59350f346c2.svg","isPro":false,"fullname":"qin","user":"chuanyu12","type":"user"},{"_id":"66ee3b99789ce1b57bb2ceee","avatarUrl":"/avatars/27abb79b2c27789013556e6637f196b9.svg","isPro":false,"fullname":"Dee Lv","user":"RealDee","type":"user"},{"_id":"641f0a142c631e05c2d050ea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641f0a142c631e05c2d050ea/642gz5LJGTkSiXBlADiUT.jpeg","isPro":false,"fullname":"Qingyi Si","user":"QingyiSi","type":"user"},{"_id":"6752f4c48f8f0d8f73f9e675","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b6bSYfkI8-_LBft8vOUh6.png","isPro":false,"fullname":"tanweiwei","user":"williamtan1","type":"user"},{"_id":"64b4eec4faa3181a5eab9c46","avatarUrl":"/avatars/bcc9bf5cbf67546ad2b4c9ec8b96ac96.svg","isPro":true,"fullname":"Jiaqi Wang","user":"myownskyW7","type":"user"},{"_id":"682acca45ac2951e757ca675","avatarUrl":"/avatars/057cda936980dcba18c72401d2b44d8e.svg","isPro":false,"fullname":"Haowen Hou","user":"haowenhou","type":"user"},{"_id":"66966286ad7167254c4bb5d6","avatarUrl":"/avatars/1a3136918a74d7ce778dcee0ca93c411.svg","isPro":false,"fullname":"Kele Shao","user":"cokeshao","type":"user"},{"_id":"60aca112bbc7c8754e233690","avatarUrl":"/avatars/99c5a17bd8262862bc415c8f33ef31f9.svg","isPro":false,"fullname":"regenli","user":"regenli","type":"user"},{"_id":"69ff22a82e2da2716395a315","avatarUrl":"/avatars/271348092cc18383d442722a57c1dfd3.svg","isPro":false,"fullname":"Hongyang Li","user":"HugHongyangLi","type":"user"},{"_id":"64e6faa3dafbba06bec0d4b2","avatarUrl":"/avatars/ba8893f6523af7bdf2b47db0916096f6.svg","isPro":false,"fullname":"Jianghan Chao","user":"roverx12345","type":"user"},{"_id":"642ac237ecec03b44647a119","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642ac237ecec03b44647a119/0nK0mty4xCYSX_AFpeWF1.jpeg","isPro":false,"fullname":"Shuai Xie","user":"ShuaiXie","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":1,"organization":{"_id":"68cb7c874f4a1865540c455c","name":"jdopensource","fullname":"JD.com Open Source","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68c0e2ab44ea28a974e3074b/g-4gTubd16qUtwmGZ0n4h.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.14777.md","query":{}}">
Papers
arxiv:2606.14777

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Published on Jun 10
· Submitted by
steven young
on Jun 16
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A vision-language model operates continuously in real-time, making autonomous decisions about when to respond or delegate, enabling interactive systems that perceive and act upon environmental changes without user prompting.

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.

Community

Paper submitter about 9 hours ago

Figure 1: Overview of JoyAI-VL-Interaction.
image

Figure 3: Overview of the JoyAI-VL-Interaction System.
image

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.14777
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.14777 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.14777 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.14777 in a Space README.md to link it from this page.

Collections including this paper 4

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers