Hugging Face Daily Papers · · 7 min read

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

CausaLab turns “AI scientist” evaluation from a static quiz into a live laboratory: an LLM agent must study noisy measurement records, choose interventions on a controllable crystal, infer the hidden structural causal model, and transfer that mechanism to predict a held-out crystal’s frequency. Its key punchline is brutal: today’s strongest agents can often get the right answer without truly discovering the right cause—GPT-5.2-high reaches 92% prediction accuracy in one observational setting but only 0.471 all-edge causal F1—showing that predictive success and causal understanding are sharply separable. By scoring both final answers and recovered mechanisms over interactive trajectories, CausaLab exposes a central bottleneck for AI scientists: models still stop too early, commit to weak hypotheses, and struggle to revise causal theories from intervention evidence.</p>\n","updatedAt":"2026-05-29T05:26:24.985Z","author":{"_id":"642b8add48f67b6f21d4eb20","avatarUrl":"/avatars/f15025b39248daa19a18e6ccb2eaaa0c.svg","fullname":"Dylan","name":"shizhuo2","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8796464204788208},"editors":["shizhuo2"],"editorAvatarUrls":["/avatars/f15025b39248daa19a18e6ccb2eaaa0c.svg"],"reactions":[],"isReport":false}},{"id":"6a1a40decae70ed8b81e0090","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:43:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions](https://huggingface.co/papers/2605.08197) (2026)\n* [NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise](https://huggingface.co/papers/2605.04313) (2026)\n* [Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents](https://huggingface.co/papers/2605.23574) (2026)\n* [CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures](https://huggingface.co/papers/2605.25338) (2026)\n* [Why LLMs Fail at Causal Discovery and How Interventional Agents Escape](https://huggingface.co/papers/2605.27567) (2026)\n* [Separable Pathways for Causal Reasoning: How Architectural Scaffolding Enables Hypothesis-Space Restructuring in LLM Agents](https://huggingface.co/papers/2604.20039) (2026)\n* [Towards a Universal Causal Reasoner](https://huggingface.co/papers/2605.24873) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.08197\">ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.04313\">NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23574\">Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25338\">CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.27567\">Why LLMs Fail at Causal Discovery and How Interventional Agents Escape</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.20039\">Separable Pathways for Causal Reasoning: How Architectural Scaffolding Enables Hypothesis-Space Restructuring in LLM Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.24873\">Towards a Universal Causal Reasoner</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:43:58.194Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7304800748825073},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26029","authors":[{"_id":"6a19187756b4bb14ec65d057","name":"Junlin Yang","hidden":false},{"_id":"6a19187756b4bb14ec65d058","name":"Dylan Zhang","hidden":false},{"_id":"6a19187756b4bb14ec65d059","name":"Xiangchen Song","hidden":false},{"_id":"6a19187756b4bb14ec65d05a","name":"Qirun Dai","hidden":false},{"_id":"6a19187756b4bb14ec65d05b","name":"Xiao Liu","hidden":false},{"_id":"6a19187756b4bb14ec65d05c","name":"Yuen Chen","hidden":false},{"_id":"6a19187756b4bb14ec65d05d","name":"Aniket Vashishtha","hidden":false},{"_id":"6a19187756b4bb14ec65d05e","name":"Jing Shi","hidden":false},{"_id":"6a19187756b4bb14ec65d05f","name":"Chenhao Tan","hidden":false},{"_id":"6a19187756b4bb14ec65d060","name":"Hao Peng","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists","submittedOnDailyBy":{"_id":"642b8add48f67b6f21d4eb20","avatarUrl":"/avatars/f15025b39248daa19a18e6ccb2eaaa0c.svg","isPro":true,"fullname":"Dylan","user":"shizhuo2","type":"user","name":"shizhuo2"},"summary":"We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge.\n Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.","upvotes":9,"discussionId":"6a19187756b4bb14ec65d061","ai_summary":"CausaLab evaluates LLM agents on causal discovery by requiring both accurate predictions and faithful recovery of underlying causal mechanisms through synthetic experimental scenarios.","ai_keywords":["causal discovery","structural causal model","observational data","intervention","causal graph","structural equations","experimental causal reasoning","predictive success","causal understanding"],"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"642b8add48f67b6f21d4eb20","avatarUrl":"/avatars/f15025b39248daa19a18e6ccb2eaaa0c.svg","isPro":true,"fullname":"Dylan","user":"shizhuo2","type":"user"},{"_id":"687ebb87416fcff56959d817","avatarUrl":"/avatars/8f13090dd6179bb18e8c8d205fd20131.svg","isPro":false,"fullname":"Keyang Xuan","user":"keyangx3","type":"user"},{"_id":"698275f84e38b38bdf86c75c","avatarUrl":"/avatars/37345ed16edbe6f469f52d3a1640db5b.svg","isPro":false,"fullname":"Yihang Sun","user":"YihangSun","type":"user"},{"_id":"64931c786b2ee2f5ef96ab25","avatarUrl":"/avatars/f7e6d205a88113a609a84ca6d304babb.svg","isPro":false,"fullname":"Chejian Xu","user":"chejian","type":"user"},{"_id":"6705f27bbaae2d7ee2b4bf30","avatarUrl":"/avatars/ab79b73ca39ebc1793a2a8540d2ff1c6.svg","isPro":false,"fullname":"Zhang","user":"Diluner","type":"user"},{"_id":"679fff15d09622fbebbe1395","avatarUrl":"/avatars/109e6756cedb7d2b2d519564813db895.svg","isPro":false,"fullname":"Zhanyang Jin","user":"HolmesS","type":"user"},{"_id":"64e402b9030431c0c6802aca","avatarUrl":"/avatars/d2841af38913a8c4335f217487e3bdf7.svg","isPro":false,"fullname":"Zhuowen Yuan","user":"ZhuowenYuan","type":"user"},{"_id":"64bf8fdb76a6e2efccec8b29","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/jGJDt3_Gy5u9n9hp3sl1l.jpeg","isPro":true,"fullname":"Qirun Dai","user":"Raidriar-Dai","type":"user"},{"_id":"63f92d5f1ed25bd00c39c52f","avatarUrl":"/avatars/9902fd9b95ed2708a8cc3747933213be.svg","isPro":false,"fullname":"Mintong Kang","user":"Cometkmt","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26029.md"}">
Papers
arxiv:2605.26029

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Published on May 28
· Submitted by
Dylan
on May 29
Authors:
,
,
,
,
,
,
,
,
,

Abstract

CausaLab evaluates LLM agents on causal discovery by requiring both accurate predictions and faithful recovery of underlying causal mechanisms through synthetic experimental scenarios.

AI-generated summary

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

Community

Paper submitter 1 day ago
edited 1 day ago

CausaLab turns “AI scientist” evaluation from a static quiz into a live laboratory: an LLM agent must study noisy measurement records, choose interventions on a controllable crystal, infer the hidden structural causal model, and transfer that mechanism to predict a held-out crystal’s frequency. Its key punchline is brutal: today’s strongest agents can often get the right answer without truly discovering the right cause—GPT-5.2-high reaches 92% prediction accuracy in one observational setting but only 0.471 all-edge causal F1—showing that predictive success and causal understanding are sharply separable. By scoring both final answers and recovered mechanisms over interactive trajectories, CausaLab exposes a central bottleneck for AI scientists: models still stop too early, commit to weak hypotheses, and struggle to revise causal theories from intervention evidence.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26029
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26029 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26029 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26029 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers