Hugging Face Daily Papers · June 16, 2026 · 5 min read

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Figure 1: Overview of JoyAI-VL-Interaction. <a href=\"https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/owlOnNXdxxvHHqrqWz2As.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/owlOnNXdxxvHHqrqWz2As.png\" alt=\"image\"></a>\nFigure 3: Overview of the JoyAI-VL-Interaction System. <a href=\"https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/2hY3_Fb-arSAGEzqAhUz9.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/2hY3_Fb-arSAGEzqAhUz9.png\" alt=\"image\"></a>\n","updatedAt":"2026-06-16T04:09:12.702Z","author":{"_id":"66f4d2b62dc07e76a1dc464b","avatarUrl":"/avatars/b1884b750a6ec46a83d247e933d59d63.svg","fullname":"steven young","name":"iieycx","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6241653561592102},"editors":["iieycx"],"editorAvatarUrls":["/avatars/b1884b750a6ec46a83d247e933d59d63.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.14777","authors":[{"_id":"6a30c3bfa0d4daae4285fefd","name":"Dingyu Yao","hidden":false},{"_id":"6a30c3bfa0d4daae4285fefe","name":"Junhao Zhou","hidden":false},{"_id":"6a30c3bfa0d4daae4285feff","name":"Chenxu Yang","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff00","name":"Chuanyu Qin","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff01","name":"Haowen Hou","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff02","name":"Zheming Liang","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff03","name":"Congcong Wang","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff04","name":"Yuhang Cao","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff05","name":"Shenglong Ye","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff06","name":"Shuai Xie","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff07","name":"Shuhuan Gu","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff08","name":"Haoyang Huang","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff09","name":"Qingyi Si","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff0a","name":"Nan Duan","hidden":false},{"_id":"6a30c3bfa0d4daae4285ff0b","user":{"_id":"64b4eec4faa3181a5eab9c46","avatarUrl":"/avatars/bcc9bf5cbf67546ad2b4c9ec8b96ac96.svg","isPro":true,"fullname":"Jiaqi Wang","user":"myownskyW7","type":"user","name":"myownskyW7"},"name":"Jiaqi Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:05:58.790Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/rPwOTrTL9yUkRzOgUfNoM.mp4","https://cdn-uploads.huggingface.co/production/uploads/66f4d2b62dc07e76a1dc464b/BunI2wEEfCDQcodB7q3uq.mp4"],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence","submittedOnDailyBy":{"_id":"66f4d2b62dc07e76a1dc464b","avatarUrl":"/avatars/b1884b750a6ec46a83d247e933d59d63.svg","isPro":false,"fullname":"steven young","user":"iieycx","type":"user","name":"iieycx"},"summary":"Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.","upvotes":152,"discussionId":"6a30c3bfa0d4daae4285ff0c","projectPage":"https://joyai-vl-video-future-academy-jd.github.io/JoyAI-VL-Interaction/","githubRepo":"https://github.com/jd-opensource/JoyAI-VL-Interaction","githubRepoAddedBy":"user","ai_summary":"A vision-language model operates continuously in real-time, making autonomous decisions about when to respond or delegate, enabling interactive systems that perceive and act upon environmental changes without user prompting.","ai_keywords":["vision-language model","real-time interaction","autonomous decision-making","vision-triggered responsiveness","time awareness","background model","deployable system","video streaming","interactive AI","continuous perception"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":100,"organization":{"_id":"68cb7c874f4a1865540c455c","name":"jdopensource","fullname":"JD.com Open Source","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68c0e2ab44ea28a974e3074b/g-4gTubd16qUtwmGZ0n4h.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f4d2b62dc07e76a1dc464b","avatarUrl":"/avatars/b1884b750a6ec46a83d247e933d59d63.svg","isPro":false,"fullname":"steven young","user":"iieycx","type":"user"},{"_id":"65dda7969f0a3506f7986a1a","avatarUrl":"/avatars/32c720b2d0070de401cef59350f346c2.svg","isPro":false,"fullname":"qin","user":"chuanyu12","type":"user"},{"_id":"66ee3b99789ce1b57bb2ceee","avatarUrl":"/avatars/27abb79b2c27789013556e6637f196b9.svg","isPro":false,"fullname":"Dee Lv","user":"RealDee","type":"user"},{"_id":"641f0a142c631e05c2d050ea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641f0a142c631e05c2d050ea/642gz5LJGTkSiXBlADiUT.jpeg","isPro":false,"fullname":"Qingyi Si","user":"QingyiSi","type":"user"},{"_id":"6752f4c48f8f0d8f73f9e675","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b6bSYfkI8-_LBft8vOUh6.png","isPro":false,"fullname":"tanweiwei","user":"williamtan1","type":"user"},{"_id":"64b4eec4faa3181a5eab9c46","avatarUrl":"/avatars/bcc9bf5cbf67546ad2b4c9ec8b96ac96.svg","isPro":true,"fullname":"Jiaqi Wang","user":"myownskyW7","type":"user"},{"_id":"682acca45ac2951e757ca675","avatarUrl":"/avatars/057cda936980dcba18c72401d2b44d8e.svg","isPro":false,"fullname":"Haowen Hou","user":"haowenhou","type":"user"},{"_id":"66966286ad7167254c4bb5d6","avatarUrl":"/avatars/1a3136918a74d7ce778dcee0ca93c411.svg","isPro":false,"fullname":"Kele Shao","user":"cokeshao","type":"user"},{"_id":"60aca112bbc7c8754e233690","avatarUrl":"/avatars/99c5a17bd8262862bc415c8f33ef31f9.svg","isPro":false,"fullname":"regenli","user":"regenli","type":"user"},{"_id":"69ff22a82e2da2716395a315","avatarUrl":"/avatars/271348092cc18383d442722a57c1dfd3.svg","isPro":false,"fullname":"Hongyang Li","user":"HugHongyangLi","type":"user"},{"_id":"64e6faa3dafbba06bec0d4b2","avatarUrl":"/avatars/ba8893f6523af7bdf2b47db0916096f6.svg","isPro":false,"fullname":"Jianghan Chao","user":"roverx12345","type":"user"},{"_id":"642ac237ecec03b44647a119","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642ac237ecec03b44647a119/0nK0mty4xCYSX_AFpeWF1.jpeg","isPro":false,"fullname":"Shuai Xie","user":"ShuaiXie","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":1,"organization":{"_id":"68cb7c874f4a1865540c455c","name":"jdopensource","fullname":"JD.com Open Source","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68c0e2ab44ea28a974e3074b/g-4gTubd16qUtwmGZ0n4h.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.14777.md","query":{}}">

Papers

arxiv:2606.14777

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Published on Jun 10

· Submitted by

steven young on Jun 16

Authors:

Jiaqi Wang

Abstract

A vision-language model operates continuously in real-time, making autonomous decisions about when to respond or delegate, enabling interactive systems that perceive and act upon environmental changes without user prompting.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.