Hugging Face Daily Papers · · 4 min read

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

OmniGameArena is a real-time benchmark of 12 new Unreal Engine 5 games (7 Solo, 3 PvP, 2 Coop). They share one action interface, so commercial VLMs, open-weight VLMs, and specialized game policies are all tested the same way. On top of the cold-start leaderboard, we add the Improvement Dynamics Curve (IDC): the agent reflects on its own play over several rounds, and we track how much the score goes up and whether the learned skill still works on unseen game variants. The project page has the leaderboard, gameplay videos, and a demo you can play in the browser.</p>\n","updatedAt":"2026-06-09T04:34:57.596Z","author":{"_id":"6548662e08568852409762f6","avatarUrl":"/avatars/d12ff2564375d018669248caaeed1e1a.svg","fullname":"Mingxian Lin","name":"mxlin043","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9132999777793884},"editors":["mxlin043"],"editorAvatarUrls":["/avatars/d12ff2564375d018669248caaeed1e1a.svg"],"reactions":[],"isReport":false}},{"id":"6a279aa269e3f267afb51e7f","author":{"_id":"6548662e08568852409762f6","avatarUrl":"/avatars/d12ff2564375d018669248caaeed1e1a.svg","fullname":"Mingxian Lin","name":"mxlin043","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-09T04:46:26.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Project Page: https://mxlin043.github.io/OmniGameArena/\nCode: https://github.com/mxlin043/OmniGameArena\nEnvironment (HF): https://huggingface.co/datasets/mxlin043/OmniGameArena","html":"<p>Project Page: <a href=\"https://mxlin043.github.io/OmniGameArena/\" rel=\"nofollow\">https://mxlin043.github.io/OmniGameArena/</a><br>Code: <a href=\"https://github.com/mxlin043/OmniGameArena\" rel=\"nofollow\">https://github.com/mxlin043/OmniGameArena</a><br>Environment (HF): <a href=\"https://huggingface.co/datasets/mxlin043/OmniGameArena\">https://huggingface.co/datasets/mxlin043/OmniGameArena</a></p>\n","updatedAt":"2026-06-09T04:46:26.079Z","author":{"_id":"6548662e08568852409762f6","avatarUrl":"/avatars/d12ff2564375d018669248caaeed1e1a.svg","fullname":"Mingxian Lin","name":"mxlin043","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6175764203071594},"editors":["mxlin043"],"editorAvatarUrls":["/avatars/d12ff2564375d018669248caaeed1e1a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09826","authors":[{"_id":"6a2793716dde1c5ef75bd0b9","name":"Mingxian Lin","hidden":false},{"_id":"6a2793716dde1c5ef75bd0ba","name":"Shengju Qian","hidden":false},{"_id":"6a2793716dde1c5ef75bd0bb","name":"Yuqi Liu","hidden":false},{"_id":"6a2793716dde1c5ef75bd0bc","name":"Yi-Hua Huang","hidden":false},{"_id":"6a2793716dde1c5ef75bd0bd","name":"Yiyu Wang","hidden":false},{"_id":"6a2793716dde1c5ef75bd0be","name":"Wei Huang","hidden":false},{"_id":"6a2793716dde1c5ef75bd0bf","name":"Yitang Li","hidden":false},{"_id":"6a2793716dde1c5ef75bd0c0","name":"Fan Zhang","hidden":false},{"_id":"6a2793716dde1c5ef75bd0c1","name":"Zeyu Hu","hidden":false},{"_id":"6a2793716dde1c5ef75bd0c2","name":"Lingting Zhu","hidden":false},{"_id":"6a2793716dde1c5ef75bd0c3","name":"Xin Wang","hidden":false},{"_id":"6a2793716dde1c5ef75bd0c4","name":"Xiaojuan Qi","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6548662e08568852409762f6/XrLgwAOEe-d95SYmsedNl.png"],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics","submittedOnDailyBy":{"_id":"6548662e08568852409762f6","avatarUrl":"/avatars/d12ff2564375d018669248caaeed1e1a.svg","isPro":false,"fullname":"Mingxian Lin","user":"mxlin043","type":"user","name":"mxlin043"},"summary":"Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.","upvotes":13,"discussionId":"6a2793716dde1c5ef75bd0c5","projectPage":"https://mxlin043.github.io/OmniGameArena/","githubRepo":"https://github.com/mxlin043/OmniGameArena","githubRepoAddedBy":"user","ai_summary":"OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization.","ai_keywords":["Vision-language model","Unreal Engine 5","real-time benchmark","Improvement Dynamics Curve","tool-using reflector LLM","bounded skill prompt","reflection rounds","skill generalization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":5,"organization":{"_id":"66deb312fd7d68a29348aa8d","name":"TheHKU","fullname":"Hong Kong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66dc525add44163a31059cf6/kyqlTADY27mPRTqznqQFL.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6548662e08568852409762f6","avatarUrl":"/avatars/d12ff2564375d018669248caaeed1e1a.svg","isPro":false,"fullname":"Mingxian Lin","user":"mxlin043","type":"user"},{"_id":"669cefd6119595d21b55a995","avatarUrl":"/avatars/bafc2387ee70b263bf45c42159381da8.svg","isPro":false,"fullname":"Yuqi Liu","user":"Ricky06662","type":"user"},{"_id":"6639ad487c0ab4fd9df1dde5","avatarUrl":"/avatars/8cc99f6ed8f8c1b2a14dde797a991a8c.svg","isPro":false,"fullname":"Fan Zhang","user":"Karl28","type":"user"},{"_id":"6380580f42cedbc20c7bef71","avatarUrl":"/avatars/8d710e0de551cd2bf545cc31fcaf099d.svg","isPro":false,"fullname":"Shengju Qian","user":"thesouthfrog","type":"user"},{"_id":"6418554a0956be7233a1023e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6418554a0956be7233a1023e/9EKN0GoOpcDbvBDmAQEJf.png","isPro":false,"fullname":"zhang yuechen","user":"julianjuaner","type":"user"},{"_id":"694de3e2245dfaccfbde7743","avatarUrl":"/avatars/f84c651d237a2af410310ea9c0970d47.svg","isPro":false,"fullname":"π","user":"iex99","type":"user"},{"_id":"6843c277e966f15e05a5079e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/PNbtm67oZ0UMzlAsIyDjR.png","isPro":false,"fullname":"Hill","user":"Lilybet","type":"user"},{"_id":"64f409e314d3972955dfb8a6","avatarUrl":"/avatars/bedc8ad492b1a0034e3c81b43670834c.svg","isPro":true,"fullname":"Yang Luo","user":"yang29","type":"user"},{"_id":"63e992cdccae1fe5c6222f84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e992cdccae1fe5c6222f84/IvksSUf2DENfUwCZSmNPd.jpeg","isPro":true,"fullname":"Guowei Xu","user":"Xkev","type":"user"},{"_id":"676a0a8f528f8ca2a5d15097","avatarUrl":"/avatars/289e3b2abab8f1b6a92859e9eb3ceae6.svg","isPro":false,"fullname":"lian","user":"lianqing11","type":"user"},{"_id":"66d0a8be6902676f56b09586","avatarUrl":"/avatars/c82db016003e9382dceaaf027597ecd9.svg","isPro":false,"fullname":"Luis Reese","user":"hellohawaii2","type":"user"},{"_id":"662d9733bf97b69795f1e768","avatarUrl":"/avatars/ebeea8bb84764d0fc786d29c9439e04c.svg","isPro":false,"fullname":"Haoze He","user":"HectorHe","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66deb312fd7d68a29348aa8d","name":"TheHKU","fullname":"Hong Kong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66dc525add44163a31059cf6/kyqlTADY27mPRTqznqQFL.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09826.md"}">
Papers
arxiv:2606.09826

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Published on Jun 8
· Submitted by
Mingxian Lin
on Jun 9
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization.

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

Community

Paper submitter about 3 hours ago

OmniGameArena is a real-time benchmark of 12 new Unreal Engine 5 games (7 Solo, 3 PvP, 2 Coop). They share one action interface, so commercial VLMs, open-weight VLMs, and specialized game policies are all tested the same way. On top of the cold-start leaderboard, we add the Improvement Dynamics Curve (IDC): the agent reflects on its own play over several rounds, and we track how much the score goes up and whether the learned skill still works on unseen game variants. The project page has the leaderboard, gameplay videos, and a demo you can play in the browser.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.09826
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09826 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09826 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09826 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers