Hugging Face Daily Papers · · 4 min read

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

🚀 Introducing SCOPE — an interactive world model for FPS games.</p>\n<p>SCOPE handles dense FPS controls by learning per-pixel temporal action responses, separating localized weapon/scope effects from stable scene generation without segmentation labels.</p>\n<p>We also release CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry: 69K clips, 7 titles, 10-DoF controls.</p>\n<p>👇 Dive in:<br>📄 Arxiv: <a href=\"http://arxiv.org/abs/2605.23345\" rel=\"nofollow\">http://arxiv.org/abs/2605.23345</a><br>🏠 Project Page: <a href=\"https://z2tong.github.io/SCOPE/\" rel=\"nofollow\">https://z2tong.github.io/SCOPE/</a><br>💻 Code: <a href=\"https://github.com/z2tong/SCOPE\" rel=\"nofollow\">https://github.com/z2tong/SCOPE</a><br>🤗 Model: <a href=\"https://huggingface.co/zizhaotong/SCOPE\">https://huggingface.co/zizhaotong/SCOPE</a><br>🗄️ Dataset: <a href=\"https://huggingface.co/collections/zizhaotong/crossfps\">https://huggingface.co/collections/zizhaotong/crossfps</a></p>\n","updatedAt":"2026-05-25T03:17:45.844Z","author":{"_id":"668e740f1173ab43d9d9ed5e","avatarUrl":"/avatars/caa9b47c2a5f6d6d679759b8b234a0ab.svg","fullname":"Zeqing Wang","name":"INV-WZQ","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6986383199691772},"editors":["INV-WZQ"],"editorAvatarUrls":["/avatars/caa9b47c2a5f6d6d679759b8b234a0ab.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.23345","authors":[{"_id":"6a13ab394d9e8d8602d201d8","name":"Zizhao Tong","hidden":false},{"_id":"6a13ab394d9e8d8602d201d9","name":"Hongfeng Lai","hidden":false},{"_id":"6a13ab394d9e8d8602d201da","name":"Zeqing Wang","hidden":false},{"_id":"6a13ab394d9e8d8602d201db","name":"Zhaohu Xing","hidden":false},{"_id":"6a13ab394d9e8d8602d201dc","name":"Kexu Cheng","hidden":false},{"_id":"6a13ab394d9e8d8602d201dd","name":"Haoran Xu","hidden":false},{"_id":"6a13ab394d9e8d8602d201de","name":"Zhao Pu","hidden":false},{"_id":"6a13ab394d9e8d8602d201df","name":"Shangwen Zhu","hidden":false},{"_id":"6a13ab394d9e8d8602d201e0","name":"Ruili Feng","hidden":false},{"_id":"6a13ab394d9e8d8602d201e1","name":"Jian Zhao","hidden":false},{"_id":"6a13ab394d9e8d8602d201e2","name":"Yan Zhang","hidden":false},{"_id":"6a13ab394d9e8d8602d201e3","name":"Hao Tang","hidden":false},{"_id":"6a13ab394d9e8d8602d201e4","name":"Yeying Jin","hidden":false},{"_id":"6a13ab394d9e8d8602d201e5","name":"Ling Shao","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/668e740f1173ab43d9d9ed5e/MILmgDd-Edp-vQVSNFNyw.mp4"],"publishedAt":"2026-05-22T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models","submittedOnDailyBy":{"_id":"668e740f1173ab43d9d9ed5e","avatarUrl":"/avatars/caa9b47c2a5f6d6d679759b8b234a0ab.svg","isPro":true,"fullname":"Zeqing Wang","user":"INV-WZQ","type":"user","name":"INV-WZQ"},"summary":"Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.","upvotes":7,"discussionId":"6a13ab3a4d9e8d8602d201e6","projectPage":"https://z2tong.github.io/SCOPE/","githubRepo":"https://github.com/z2tong/SCOPE","githubRepoAddedBy":"user","ai_summary":"SCOPE enables precise action response in FPS games by conditioning transformer blocks in video diffusion models to separate in-scope from out-of-scope visual effects without segmentation labels.","ai_keywords":["transformer blocks","video diffusion models","conditioning module","per-pixel temporal sequences","action response","scope separation","CrossFPS","multi-game dataset","zero-shot transfer","visual-to-action mappings"],"githubStars":7},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"637f114c1dbae0919108987d","avatarUrl":"/avatars/23d73811b697261ceb80ef1b0806a633.svg","isPro":false,"fullname":"Zizhao Tong","user":"zizhaotong","type":"user"},{"_id":"668e740f1173ab43d9d9ed5e","avatarUrl":"/avatars/caa9b47c2a5f6d6d679759b8b234a0ab.svg","isPro":true,"fullname":"Zeqing Wang","user":"INV-WZQ","type":"user"},{"_id":"66f18c7982d5de5715393736","avatarUrl":"/avatars/dd278f91dab5cf1be97a751027a637b1.svg","isPro":false,"fullname":"haoran xu","user":"pianzhikuang","type":"user"},{"_id":"64893d724b47b34bd3d3ad54","avatarUrl":"/avatars/fcbe182a7cd2e65def95e87fe84c2e13.svg","isPro":false,"fullname":"Range King","user":"RangeKing","type":"user"},{"_id":"65254c565378d720ebb098fa","avatarUrl":"/avatars/10c7e746799754ca5566ce030f812e5f.svg","isPro":false,"fullname":"taylorrr","user":"taylorrr","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6729d1fed3ec5370cb035901","avatarUrl":"/avatars/50f7ce9c635148df76d1c63ebf3efa38.svg","isPro":false,"fullname":"1","user":"DANNY621","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.23345.md"}">
Papers
arxiv:2605.23345

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

Published on May 22
· Submitted by
Zeqing Wang
on May 25
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

SCOPE enables precise action response in FPS games by conditioning transformer blocks in video diffusion models to separate in-scope from out-of-scope visual effects without segmentation labels.

AI-generated summary

Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.

Community

Paper submitter about 8 hours ago

🚀 Introducing SCOPE — an interactive world model for FPS games.

SCOPE handles dense FPS controls by learning per-pixel temporal action responses, separating localized weapon/scope effects from stable scene generation without segmentation labels.

We also release CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry: 69K clips, 7 titles, 10-DoF controls.

👇 Dive in:
📄 Arxiv: http://arxiv.org/abs/2605.23345
🏠 Project Page: https://z2tong.github.io/SCOPE/
💻 Code: https://github.com/z2tong/SCOPE
🤗 Model: https://huggingface.co/zizhaotong/SCOPE
🗄️ Dataset: https://huggingface.co/collections/zizhaotong/crossfps

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.23345
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.23345 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers