Hugging Face Daily Papers · June 9, 2026 · 4 min read

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

#model-release #multimodal #agents #benchmark

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

OmniGameArena is a real-time benchmark of 12 new Unreal Engine 5 games (7 Solo, 3 PvP, 2 Coop). They share one action interface, so commercial VLMs, open-weight VLMs, and specialized game policies are all tested the same way. On top of the cold-start leaderboard, we add the Improvement Dynamics Curve (IDC): the agent reflects on its own play over several rounds, and we track how much the score goes up and whether the learned skill still works on unseen game variants. The project page has the leaderboard, gameplay videos, and a demo you can play in the browser.\n","updatedAt":"2026-06-09T04:34:57.596Z","author":{"_id":"6548662e08568852409762f6","avatarUrl":"/avatars/d12ff2564375d018669248caaeed1e1a.svg","fullname":"Mingxian Lin","name":"mxlin043","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9132999777793884},"editors":["mxlin043"],"editorAvatarUrls":["/avatars/d12ff2564375d018669248caaeed1e1a.svg"],"reactions":[],"isReport":false}},{"id":"6a279aa269e3f267afb51e7f","author":{"_id":"6548662e08568852409762f6","avatarUrl":"/avatars/d12ff2564375d018669248caaeed1e1a.svg","fullname":"Mingxian Lin","name":"mxlin043","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-09T04:46:26.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Project Page: https://mxlin043.github.io/OmniGameArena/\nCode: https://github.com/mxlin043/OmniGameArena\nEnvironment (HF): https://huggingface.co/datasets/mxlin043/OmniGameArena","html":"Project Page: <a href=\"https://mxlin043.github.io/OmniGameArena/\" rel=\"nofollow\">https://mxlin043.github.io/OmniGameArena/</a> Code: <a href=\"https://github.com/mxlin043/OmniGameArena\" rel=\"nofollow\">https://github.com/mxlin043/OmniGameArena</a> Environment (HF): <a href=\"https://huggingface.co/datasets/mxlin043/OmniGameArena\">https://huggingface.co/datasets/mxlin043/OmniGameArena</a>\n","updatedAt":"2026-06-09T04:46:26.079Z","author":{"_id":"6548662e08568852409762f6","avatarUrl":"/avatars/d12ff2564375d018669248caaeed1e1a.svg","fullname":"Mingxian Lin","name":"mxlin043","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6175764203071594},"editors":["mxlin043"],"editorAvatarUrls":["/avatars/d12ff2564375d018669248caaeed1e1a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09826","authors":[{"_id":"6a2793716dde1c5ef75bd0b9","name":"Mingxian Lin","hidden":false},{"_id":"6a2793716dde1c5ef75bd0ba","name":"Shengju Qian","hidden":false},{"_id":"6a2793716dde1c5ef75bd0bb","name":"Yuqi Liu","hidden":false},{"_id":"6a2793716dde1c5ef75bd0bc","name":"Yi-Hua Huang","hidden":false},{"_id":"6a2793716dde1c5ef75bd0bd","name":"Yiyu Wang","hidden":false},{"_id":"6a2793716dde1c5ef75bd0be","name":"Wei Huang","hidden":false},{"_id":"6a2793716dde1c5ef75bd0bf","name":"Yitang Li","hidden":false},{"_id":"6a2793716dde1c5ef75bd0c0","name":"Fan Zhang","hidden":false},{"_id":"6a2793716dde1c5ef75bd0c1","name":"Zeyu Hu","hidden":false},{"_id":"6a2793716dde1c5ef75bd0c2","name":"Lingting Zhu","hidden":false},{"_id":"6a2793716dde1c5ef75bd0c3","name":"Xin Wang","hidden":false},{"_id":"6a2793716dde1c5ef75bd0c4","name":"Xiaojuan Qi","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6548662e08568852409762f6/XrLgwAOEe-d95SYmsedNl.png"],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics","submittedOnDailyBy":{"_id":"6548662e08568852409762f6","avatarUrl":"/avatars/d12ff2564375d018669248caaeed1e1a.svg","isPro":false,"fullname":"Mingxian Lin","user":"mxlin043","type":"user","name":"mxlin043"},"summary":"Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.","upvotes":13,"discussionId":"6a2793716dde1c5ef75bd0c5","projectPage":"https://mxlin043.github.io/OmniGameArena/","githubRepo":"https://github.com/mxlin043/OmniGameArena","githubRepoAddedBy":"user","ai_summary":"OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization.","ai_keywords":["Vision-language model","Unreal Engine 5","real-time benchmark","Improvement Dynamics Curve","tool-using reflector LLM","bounded skill prompt","reflection rounds","skill generalization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":5,"organization":{"_id":"66deb312fd7d68a29348aa8d","name":"TheHKU","fullname":"Hong Kong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66dc525add44163a31059cf6/kyqlTADY27mPRTqznqQFL.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6548662e08568852409762f6","avatarUrl":"/avatars/d12ff2564375d018669248caaeed1e1a.svg","isPro":false,"fullname":"Mingxian Lin","user":"mxlin043","type":"user"},{"_id":"669cefd6119595d21b55a995","avatarUrl":"/avatars/bafc2387ee70b263bf45c42159381da8.svg","isPro":false,"fullname":"Yuqi Liu","user":"Ricky06662","type":"user"},{"_id":"6639ad487c0ab4fd9df1dde5","avatarUrl":"/avatars/8cc99f6ed8f8c1b2a14dde797a991a8c.svg","isPro":false,"fullname":"Fan Zhang","user":"Karl28","type":"user"},{"_id":"6380580f42cedbc20c7bef71","avatarUrl":"/avatars/8d710e0de551cd2bf545cc31fcaf099d.svg","isPro":false,"fullname":"Shengju Qian","user":"thesouthfrog","type":"user"},{"_id":"6418554a0956be7233a1023e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6418554a0956be7233a1023e/9EKN0GoOpcDbvBDmAQEJf.png","isPro":false,"fullname":"zhang yuechen","user":"julianjuaner","type":"user"},{"_id":"694de3e2245dfaccfbde7743","avatarUrl":"/avatars/f84c651d237a2af410310ea9c0970d47.svg","isPro":false,"fullname":"π","user":"iex99","type":"user"},{"_id":"6843c277e966f15e05a5079e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/PNbtm67oZ0UMzlAsIyDjR.png","isPro":false,"fullname":"Hill","user":"Lilybet","type":"user"},{"_id":"64f409e314d3972955dfb8a6","avatarUrl":"/avatars/bedc8ad492b1a0034e3c81b43670834c.svg","isPro":true,"fullname":"Yang Luo","user":"yang29","type":"user"},{"_id":"63e992cdccae1fe5c6222f84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e992cdccae1fe5c6222f84/IvksSUf2DENfUwCZSmNPd.jpeg","isPro":true,"fullname":"Guowei Xu","user":"Xkev","type":"user"},{"_id":"676a0a8f528f8ca2a5d15097","avatarUrl":"/avatars/289e3b2abab8f1b6a92859e9eb3ceae6.svg","isPro":false,"fullname":"lian","user":"lianqing11","type":"user"},{"_id":"66d0a8be6902676f56b09586","avatarUrl":"/avatars/c82db016003e9382dceaaf027597ecd9.svg","isPro":false,"fullname":"Luis Reese","user":"hellohawaii2","type":"user"},{"_id":"662d9733bf97b69795f1e768","avatarUrl":"/avatars/ebeea8bb84764d0fc786d29c9439e04c.svg","isPro":false,"fullname":"Haoze He","user":"HectorHe","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66deb312fd7d68a29348aa8d","name":"TheHKU","fullname":"Hong Kong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66dc525add44163a31059cf6/kyqlTADY27mPRTqznqQFL.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09826.md"}">

Papers

arxiv:2606.09826

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Published on Jun 8

· Submitted by

Mingxian Lin on Jun 9

Hong Kong University

Upvote

Authors:

Abstract

OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.