Hugging Face Daily Papers · · 6 min read

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Can coding agents build an actual game in a real game engine?</p>\n<p>We introduce <strong>GameCraft-Bench</strong>, a benchmark of <strong>140 Godot tasks across 15 game families</strong> for evaluating end-to-end game generation through interactive gameplay verification.</p>\n<p>The strongest frontier agent achieves only <strong>41.46%</strong>, suggesting that creating complete, playable games remains far from solved.</p>\n<p>Demos, code, and data: <a href=\"https://tongxuluo.github.io/gamecraft-bench-website/\" rel=\"nofollow\">https://tongxuluo.github.io/gamecraft-bench-website/</a></p>\n","updatedAt":"2026-06-17T01:59:14.943Z","author":{"_id":"6421b07e918f0fd889f0a682","avatarUrl":"/avatars/314b55c2428426c846d9449f98db4355.svg","fullname":"Tongxu Luo","name":"Zeno-Luo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7977036833763123},"editors":["Zeno-Luo"],"editorAvatarUrls":["/avatars/314b55c2428426c846d9449f98db4355.svg"],"reactions":[],"isReport":false}},{"id":"6a32afa252df180e7bc04083","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-06-17T14:30:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/gamecraft-bench-can-agents-build-playable-games-end-to-end-in-a-real-game-engine-8274-d45a6828\nCovers the executive summary, detailed methodology, and practical applications.","html":"<p>Interesting breakdown of this paper on arXivLens: <a href=\"https://arxivlens.com/PaperView/Details/gamecraft-bench-can-agents-build-playable-games-end-to-end-in-a-real-game-engine-8274-d45a6828\" rel=\"nofollow\">https://arxivlens.com/PaperView/Details/gamecraft-bench-can-agents-build-playable-games-end-to-end-in-a-real-game-engine-8274-d45a6828</a><br>Covers the executive summary, detailed methodology, and practical applications.</p>\n","updatedAt":"2026-06-17T14:30:58.595Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8235823512077332},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"6a32e6ea7ad7d98426cf16cf","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false},"createdAt":"2026-06-17T18:26:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Neat paper. It is interesting to see a benchmark tackle end-to-end game generation within an engine like Godot, rather than just writing standalone scripts. The focus on engine grounding and interactive verification seems like a necessary step to see if these models can actually build something playable.\n\nI am curious, since the agents often hit a wall with content and visual feedback, what do you think is the biggest bottleneck in the current feedback loop?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/03e2f80d-a440-4065-a474-82e4e64eed6a","html":"<p>Neat paper. It is interesting to see a benchmark tackle end-to-end game generation within an engine like Godot, rather than just writing standalone scripts. The focus on engine grounding and interactive verification seems like a necessary step to see if these models can actually build something playable.</p>\n<p>I am curious, since the agents often hit a wall with content and visual feedback, what do you think is the biggest bottleneck in the current feedback loop?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/03e2f80d-a440-4065-a474-82e4e64eed6a\" rel=\"nofollow\">https://researchpod.app/episode/03e2f80d-a440-4065-a474-82e4e64eed6a</a></p>\n","updatedAt":"2026-06-17T18:26:50.679Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9268215298652649},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17861","authors":[{"_id":"6a31fd7abc818ff14e453cce","user":{"_id":"6421b07e918f0fd889f0a682","avatarUrl":"/avatars/314b55c2428426c846d9449f98db4355.svg","isPro":false,"fullname":"Tongxu Luo","user":"Zeno-Luo","type":"user","name":"Zeno-Luo"},"name":"Tongxu Luo","status":"claimed_verified","statusLastChangedAt":"2026-06-17T11:21:31.853Z","hidden":false},{"_id":"6a31fd7abc818ff14e453ccf","user":{"_id":"63ca949b04c979828315389d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ca949b04c979828315389d/HS5xWNAYjjHeyAAwWJ11l.jpeg","isPro":false,"fullname":"wangrongsheng","user":"wangrongsheng","type":"user","name":"wangrongsheng"},"name":"Rongsheng Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-17T11:21:27.752Z","hidden":false},{"_id":"6a31fd7abc818ff14e453cd0","name":"Jiaxi Bi","hidden":false},{"_id":"6a31fd7abc818ff14e453cd1","name":"Chenming Xu","hidden":false},{"_id":"6a31fd7abc818ff14e453cd2","user":{"_id":"64912976b95c3f0a1e6233cb","avatarUrl":"/avatars/3e338c5eef2514055ed98ae6141a5d1a.svg","isPro":false,"fullname":"Zhengyang Tang","user":"tangzhy","type":"user","name":"tangzhy"},"name":"Zhengyang Tang","status":"claimed_verified","statusLastChangedAt":"2026-06-17T11:21:29.868Z","hidden":false},{"_id":"6a31fd7abc818ff14e453cd3","name":"Jianlong Chen","hidden":false},{"_id":"6a31fd7abc818ff14e453cd4","name":"Juhao Liang","hidden":false},{"_id":"6a31fd7abc818ff14e453cd5","name":"Ke Ji","hidden":false},{"_id":"6a31fd7abc818ff14e453cd6","name":"Shuqi Guo","hidden":false},{"_id":"6a31fd7abc818ff14e453cd7","name":"Yuhao Du","hidden":false},{"_id":"6a31fd7abc818ff14e453cd8","name":"Fan Bu","hidden":false},{"_id":"6a31fd7abc818ff14e453cd9","name":"Wenyu Du","hidden":false},{"_id":"6a31fd7abc818ff14e453cda","name":"Xiaotong Zhang","hidden":false},{"_id":"6a31fd7abc818ff14e453cdb","name":"Kyle Li","hidden":false},{"_id":"6a31fd7abc818ff14e453cdc","name":"Shaobo Wang","hidden":false},{"_id":"6a31fd7abc818ff14e453cdd","name":"Linfeng Zhang","hidden":false},{"_id":"6a31fd7abc818ff14e453cde","name":"Yuxuan Liu","hidden":false},{"_id":"6a31fd7abc818ff14e453cdf","name":"Xin Lai","hidden":false},{"_id":"6a31fd7abc818ff14e453ce0","name":"Chenxin Li","hidden":false},{"_id":"6a31fd7abc818ff14e453ce1","name":"Yiduo Guo","hidden":false},{"_id":"6a31fd7abc818ff14e453ce2","name":"Zhexin Zhang","hidden":false},{"_id":"6a31fd7abc818ff14e453ce3","user":{"_id":"67b327cdd4665a0448eef7d5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b327cdd4665a0448eef7d5/_B5Z9MCa_qiFrDj1axKlz.png","isPro":true,"fullname":"Xinyuan Wang","user":"xywang626","type":"user","name":"xywang626"},"name":"Xinyuan Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-17T11:21:24.361Z","hidden":false},{"_id":"6a31fd7abc818ff14e453ce4","name":"Tianyi Bai","hidden":false},{"_id":"6a31fd7abc818ff14e453ce5","name":"Ziniu Li","hidden":false},{"_id":"6a31fd7abc818ff14e453ce6","name":"Benyou Wang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6421b07e918f0fd889f0a682/R43nttkHNb_GQqFQAfj1A.png"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-17T00:00:00.000Z","title":"GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?","submittedOnDailyBy":{"_id":"6421b07e918f0fd889f0a682","avatarUrl":"/avatars/314b55c2428426c846d9449f98db4355.svg","isPro":false,"fullname":"Tongxu Luo","user":"Zeno-Luo","type":"user","name":"Zeno-Luo"},"summary":"Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.","upvotes":36,"discussionId":"6a31fd7bbc818ff14e453ce7","projectPage":"https://tongxuluo.github.io/gamecraft-bench-website/","githubRepo":"https://github.com/tongxuluo/gamecraft-bench","githubRepoAddedBy":"user","ai_summary":"End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive verification.","ai_keywords":["game generation","coding agents","natural-language specifications","game engine","executable gameplay","interactive verification","GameCraft-Bench","Godot","multimodal judging"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":47,"organization":{"_id":"6223644d0129f2097d69a407","name":"CUHKSZ","fullname":"Chinese University of Hong Kong, Shenzhen","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1646486592158-6108ae87823007eaf0c7bd1e.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6421b07e918f0fd889f0a682","avatarUrl":"/avatars/314b55c2428426c846d9449f98db4355.svg","isPro":false,"fullname":"Tongxu Luo","user":"Zeno-Luo","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"64f1a34f2c5c8b767916447e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f1a34f2c5c8b767916447e/uak2CsMAnxW8q4dwyAOBN.jpeg","isPro":false,"fullname":"Zhenyang Cai","user":"Eric3200","type":"user"},{"_id":"6462eaf23c6fd465dc116c23","avatarUrl":"/avatars/0a719f40086e5be61e5ce68b6353f41d.svg","isPro":false,"fullname":"JuhaoLiang","user":"JuhaoLiang","type":"user"},{"_id":"669e4cd94ea6475a575a9be7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/669e4cd94ea6475a575a9be7/ZaoQHeRbUXTCz9WJq6WWI.jpeg","isPro":false,"fullname":"shuqi guo","user":"shuqiqiqi","type":"user"},{"_id":"683053de6e739e529709d7c3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/d3WK6dvOvmD9WPAAvjWgL.png","isPro":false,"fullname":"galois0058","user":"galois1","type":"user"},{"_id":"64912976b95c3f0a1e6233cb","avatarUrl":"/avatars/3e338c5eef2514055ed98ae6141a5d1a.svg","isPro":false,"fullname":"Zhengyang Tang","user":"tangzhy","type":"user"},{"_id":"63ca949b04c979828315389d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ca949b04c979828315389d/HS5xWNAYjjHeyAAwWJ11l.jpeg","isPro":false,"fullname":"wangrongsheng","user":"wangrongsheng","type":"user"},{"_id":"637c6703ca8542a0ba900ccb","avatarUrl":"/avatars/288ed63a1efa566c3f01e850c6ba5dd5.svg","isPro":false,"fullname":"Wang","user":"Benyou","type":"user"},{"_id":"68fe5da059320b90cfcc271b","avatarUrl":"/avatars/8ed1fed7c3aef04294d36240fe3bfcbd.svg","isPro":false,"fullname":"Xiaotongliu","user":"Colachicken","type":"user"},{"_id":"69c5017457f3b7abca7d5b30","avatarUrl":"/avatars/b38e87b482e9b681670378fb6e4fbc7b.svg","isPro":false,"fullname":"Mike Zhang","user":"GongK","type":"user"},{"_id":"643379416c6ecd58798421b3","avatarUrl":"/avatars/831db7eab2663abc33b176cf386b02f2.svg","isPro":false,"fullname":"Zhuoran Jin","user":"jinzhuoran","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6223644d0129f2097d69a407","name":"CUHKSZ","fullname":"Chinese University of Hong Kong, Shenzhen","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1646486592158-6108ae87823007eaf0c7bd1e.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.17861.md","query":{}}">
Papers
arxiv:2606.17861

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Published on Jun 16
· Submitted by
Tongxu Luo
on Jun 17
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive verification.

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

Community

Paper author Paper submitter about 23 hours ago

Can coding agents build an actual game in a real game engine?

We introduce GameCraft-Bench, a benchmark of 140 Godot tasks across 15 game families for evaluating end-to-end game generation through interactive gameplay verification.

The strongest frontier agent achieves only 41.46%, suggesting that creating complete, playable games remains far from solved.

Demos, code, and data: https://tongxuluo.github.io/gamecraft-bench-website/

Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/gamecraft-bench-can-agents-build-playable-games-end-to-end-in-a-real-game-engine-8274-d45a6828
Covers the executive summary, detailed methodology, and practical applications.

Neat paper. It is interesting to see a benchmark tackle end-to-end game generation within an engine like Godot, rather than just writing standalone scripts. The focus on engine grounding and interactive verification seems like a necessary step to see if these models can actually build something playable.

I am curious, since the agents often hit a wall with content and visual feedback, what do you think is the biggest bottleneck in the current feedback loop?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/03e2f80d-a440-4065-a474-82e4e64eed6a

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.17861
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.17861 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17861 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17861 in a Space README.md to link it from this page.

Collections including this paper 4

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers