Hugging Face Daily Papers · May 15, 2026 · 3 min read

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6506832221ac448013f94995/q7PlkDmsDQIOG7qf-bCm-.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6506832221ac448013f94995/q7PlkDmsDQIOG7qf-bCm-.png\" alt=\"futuresim-env2-fig2\"></a></p>\n","updatedAt":"2026-05-15T01:54:28.376Z","author":{"_id":"6506832221ac448013f94995","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6506832221ac448013f94995/sVUI1JV4Dxan5l-MqNze4.jpeg","fullname":"Shashwat Goel","name":"shash42","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.48817914724349976},"editors":["shash42"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6506832221ac448013f94995/sVUI1JV4Dxan5l-MqNze4.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15188","authors":[{"_id":"6a067c21b1a8cbabc9f09827","name":"Shashwat Goel","hidden":false},{"_id":"6a067c21b1a8cbabc9f09828","name":"Nikhil Chandak","hidden":false},{"_id":"6a067c21b1a8cbabc9f09829","name":"Arvindh Arun","hidden":false},{"_id":"6a067c21b1a8cbabc9f0982a","name":"Ameya Prabhu","hidden":false},{"_id":"6a067c21b1a8cbabc9f0982b","name":"Steffen Staab","hidden":false},{"_id":"6a067c21b1a8cbabc9f0982c","name":"Moritz Hardt","hidden":false},{"_id":"6a067c21b1a8cbabc9f0982d","name":"Maksym Andriushchenko","hidden":false},{"_id":"6a067c21b1a8cbabc9f0982e","name":"Jonas Geiping","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"FutureSim: Replaying World Events to Evaluate Adaptive Agents","submittedOnDailyBy":{"_id":"6506832221ac448013f94995","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6506832221ac448013f94995/sVUI1JV4Dxan5l-MqNze4.jpeg","isPro":false,"fullname":"Shashwat Goel","user":"shash42","type":"user","name":"shash42"},"summary":"AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.","upvotes":2,"discussionId":"6a067c22b1a8cbabc9f0982f","projectPage":"https://openforecaster.github.io/futuresim","githubRepo":"https://github.com/OpenForecaster/futuresim","githubRepoAddedBy":"user","ai_summary":"FutureSim enables evaluation of AI agents' long-term predictive capabilities by simulating chronological real-world event sequences, revealing significant gaps in current forecasting performance.","ai_keywords":["grounded simulations","chronological replay","world events","test-time adaptation","search","memory","reasoning about uncertainty"],"githubStars":9,"organization":{"_id":"638df552a11654155baca408","name":"Intelligent-Systems","fullname":"Max Planck Institute for Intelligent Systems","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670247618868-6183d0b249ef1d984699e4a3.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63d86dbf3130cadcaf8bdd11","avatarUrl":"/avatars/29d79a0c6dcec01111ef192fecd0fa7a.svg","isPro":false,"fullname":"Jonas Geiping","user":"JonasGeiping","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"638df552a11654155baca408","name":"Intelligent-Systems","fullname":"Max Planck Institute for Intelligent Systems","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670247618868-6183d0b249ef1d984699e4a3.jpeg"}}">

Papers

arxiv:2605.15188

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Published on May 14

· Submitted by

Shashwat Goel on May 15

Max Planck Institute for Intelligent Systems

Upvote

Authors:

Abstract

FutureSim enables evaluation of AI agents' long-term predictive capabilities by simulating chronological real-world event sequences, revealing significant gaps in current forecasting performance.

AI-generated summary

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

shash42

Paper submitter about 23 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15188 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15188 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15188 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers