Hugging Face Daily Papers · · 3 min read

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6506832221ac448013f94995/q7PlkDmsDQIOG7qf-bCm-.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6506832221ac448013f94995/q7PlkDmsDQIOG7qf-bCm-.png\" alt=\"futuresim-env2-fig2\"></a></p>\n","updatedAt":"2026-05-15T01:54:28.376Z","author":{"_id":"6506832221ac448013f94995","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6506832221ac448013f94995/sVUI1JV4Dxan5l-MqNze4.jpeg","fullname":"Shashwat Goel","name":"shash42","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.48817914724349976},"editors":["shash42"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6506832221ac448013f94995/sVUI1JV4Dxan5l-MqNze4.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15188","authors":[{"_id":"6a067c21b1a8cbabc9f09827","name":"Shashwat Goel","hidden":false},{"_id":"6a067c21b1a8cbabc9f09828","name":"Nikhil Chandak","hidden":false},{"_id":"6a067c21b1a8cbabc9f09829","name":"Arvindh Arun","hidden":false},{"_id":"6a067c21b1a8cbabc9f0982a","name":"Ameya Prabhu","hidden":false},{"_id":"6a067c21b1a8cbabc9f0982b","name":"Steffen Staab","hidden":false},{"_id":"6a067c21b1a8cbabc9f0982c","name":"Moritz Hardt","hidden":false},{"_id":"6a067c21b1a8cbabc9f0982d","name":"Maksym Andriushchenko","hidden":false},{"_id":"6a067c21b1a8cbabc9f0982e","name":"Jonas Geiping","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"FutureSim: Replaying World Events to Evaluate Adaptive Agents","submittedOnDailyBy":{"_id":"6506832221ac448013f94995","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6506832221ac448013f94995/sVUI1JV4Dxan5l-MqNze4.jpeg","isPro":false,"fullname":"Shashwat Goel","user":"shash42","type":"user","name":"shash42"},"summary":"AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.","upvotes":2,"discussionId":"6a067c22b1a8cbabc9f0982f","projectPage":"https://openforecaster.github.io/futuresim","githubRepo":"https://github.com/OpenForecaster/futuresim","githubRepoAddedBy":"user","ai_summary":"FutureSim enables evaluation of AI agents' long-term predictive capabilities by simulating chronological real-world event sequences, revealing significant gaps in current forecasting performance.","ai_keywords":["grounded simulations","chronological replay","world events","test-time adaptation","search","memory","reasoning about uncertainty"],"githubStars":9,"organization":{"_id":"638df552a11654155baca408","name":"Intelligent-Systems","fullname":"Max Planck Institute for Intelligent Systems","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670247618868-6183d0b249ef1d984699e4a3.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63d86dbf3130cadcaf8bdd11","avatarUrl":"/avatars/29d79a0c6dcec01111ef192fecd0fa7a.svg","isPro":false,"fullname":"Jonas Geiping","user":"JonasGeiping","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"638df552a11654155baca408","name":"Intelligent-Systems","fullname":"Max Planck Institute for Intelligent Systems","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670247618868-6183d0b249ef1d984699e4a3.jpeg"}}">
Papers
arxiv:2605.15188

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Published on May 14
· Submitted by
Shashwat Goel
on May 15
Authors:
,
,
,
,
,
,
,

Abstract

FutureSim enables evaluation of AI agents' long-term predictive capabilities by simulating chronological real-world event sequences, revealing significant gaps in current forecasting performance.

AI-generated summary

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

Community

Paper submitter about 23 hours ago

futuresim-env2-fig2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15188 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15188 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15188 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers