GitHub Repo at <a href=\"https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code\" rel=\"nofollow\">https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code</a></p>\n","updatedAt":"2026-06-25T02:22:25.917Z","author":{"_id":"6572cbc42bb242937c0a1101","avatarUrl":"/avatars/f2af45e6b242aa47578fe3f60e97ca86.svg","fullname":"Xuanle Zhao","name":"xxxllz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5286973118782043},"editors":["xxxllz"],"editorAvatarUrls":["/avatars/f2af45e6b242aa47578fe3f60e97ca86.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15932","authors":[{"_id":"6a3c9055f3facdb67e9ff06d","name":"Xuanle Zhao","hidden":false},{"_id":"6a3c9055f3facdb67e9ff06e","name":"Qiushi Sun","hidden":false},{"_id":"6a3c9055f3facdb67e9ff06f","name":"Jingyu Xiao","hidden":false},{"_id":"6a3c9055f3facdb67e9ff070","name":"Xuexin Liu","hidden":false},{"_id":"6a3c9055f3facdb67e9ff071","name":"Haoyue Yang","hidden":false},{"_id":"6a3c9055f3facdb67e9ff072","name":"Qiaosheng Chen","hidden":false},{"_id":"6a3c9055f3facdb67e9ff073","name":"Xianzhen Luo","hidden":false},{"_id":"6a3c9055f3facdb67e9ff074","name":"Jing Huang","hidden":false},{"_id":"6a3c9055f3facdb67e9ff075","name":"Yufeng Zhong","hidden":false},{"_id":"6a3c9055f3facdb67e9ff076","name":"Lei Chen","hidden":false},{"_id":"6a3c9055f3facdb67e9ff077","name":"Shuai Fu","hidden":false},{"_id":"6a3c9055f3facdb67e9ff078","name":"Zhenlin Wei","hidden":false},{"_id":"6a3c9055f3facdb67e9ff079","name":"Jinhe Bi","hidden":false},{"_id":"6a3c9055f3facdb67e9ff07a","name":"Lei Jiang","hidden":false},{"_id":"6a3c9055f3facdb67e9ff07b","name":"Haibo Qiu","hidden":false},{"_id":"6a3c9055f3facdb67e9ff07c","name":"Siqi Yang","hidden":false},{"_id":"6a3c9055f3facdb67e9ff07d","name":"Peng Shi","hidden":false},{"_id":"6a3c9055f3facdb67e9ff07e","name":"Jian Hu","hidden":false},{"_id":"6a3c9055f3facdb67e9ff07f","name":"Zhixiong Zeng","hidden":false}],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence","submittedOnDailyBy":{"_id":"6572cbc42bb242937c0a1101","avatarUrl":"/avatars/f2af45e6b242aa47578fe3f60e97ca86.svg","isPro":false,"fullname":"Xuanle Zhao","user":"xxxllz","type":"user","name":"xxxllz"},"summary":"While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, data semantics, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move this field from single-output imitation toward evidence-grounded executable systems. An ongoing project and resources are available on https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code{GitHub}.","upvotes":25,"discussionId":"6a3c9055f3facdb67e9ff080","githubRepo":"https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code","githubRepoAddedBy":"user","ai_summary":"This survey explores multimodal code intelligence systems that generate and reason with code based on visual inputs, categorizing approaches across GUI, scientific visualization, structured graphics, and emerging frameworks while identifying verification-centered research directions.","ai_keywords":["Multimodal Code Intelligence","visual perception","executable programs","visual artifacts","graphical user interface","scientific visualization","structured graphics","verifiable agent traces","multi-signal validation","multi-state verification","cross-task transfer testing"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":262},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"675c2320ca32e91d5356ef38","avatarUrl":"/avatars/22fdcfce0867c2c5fdc602d8b7c62eb7.svg","isPro":false,"fullname":"whalexiao","user":"whale99","type":"user"},{"_id":"65e1b6e9501590df0173cbd3","avatarUrl":"/avatars/a73e2139700e23eff455734c99cef5ba.svg","isPro":false,"fullname":"Jian Hu","user":"lwpyh","type":"user"},{"_id":"6572cbc42bb242937c0a1101","avatarUrl":"/avatars/f2af45e6b242aa47578fe3f60e97ca86.svg","isPro":false,"fullname":"Xuanle Zhao","user":"xxxllz","type":"user"},{"_id":"644b4291958b7796980b9d61","avatarUrl":"/avatars/0e0ecdde48c83174ef589347b3c634db.svg","isPro":false,"fullname":"Shuai Fu","user":"ShyFoo","type":"user"},{"_id":"619ef3f253061ce00477b09e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/619ef3f253061ce00477b09e/FknZhgQhV2_3aTqIKVsTo.jpeg","isPro":false,"fullname":"Qiaosheng Chen","user":"cqsss","type":"user"},{"_id":"67375a6ae6b1d15ff5359a54","avatarUrl":"/avatars/9d32d9e3bfb43b8d001c6ddeae720ec5.svg","isPro":false,"fullname":"Zela","user":"vzl123","type":"user"},{"_id":"65325fa7e789725933502c3d","avatarUrl":"/avatars/fafe8635156290925ef3dfe65030450d.svg","isPro":false,"fullname":"Lei Chen","user":"MiaSanLei","type":"user"},{"_id":"63e9df3746574e63a2cc55c5","avatarUrl":"/avatars/3d27ad1ccfa51387e4b97d02e13deb41.svg","isPro":false,"fullname":"Lingfeng Qiao","user":"leafqiaoqiao","type":"user"},{"_id":"665ebae8bcbb98f60db0b4b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665ebae8bcbb98f60db0b4b1/YTKM4qTZXh_2SeU8U7BfB.webp","isPro":false,"fullname":"Jiale Zhao","user":"Heisenburger2000","type":"user"},{"_id":"66a0a5ed229269a861c72f7f","avatarUrl":"/avatars/d08ae68230f92b5e3a79c3eea4c37499.svg","isPro":false,"fullname":"Xuexin Liu","user":"xuexin6","type":"user"},{"_id":"65813bd3035c028f3340a12b","avatarUrl":"/avatars/488f401f1abcad4ac5ea3c18205c885c.svg","isPro":false,"fullname":"siqi yang","user":"siqiya","type":"user"},{"_id":"6064a0eeb1703ddba0d458b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1617207525789-noauth.png","isPro":false,"fullname":"Qiushi","user":"QiushiSun","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15932.md","query":{}}">
Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
This survey explores multimodal code intelligence systems that generate and reason with code based on visual inputs, categorizing approaches across GUI, scientific visualization, structured graphics, and emerging frameworks while identifying verification-centered research directions.
While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, data semantics, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move this field from single-output imitation toward evidence-grounded executable systems. An ongoing project and resources are available on https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code{GitHub}.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.15932 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.15932 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.15932 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.