project page: <a href=\"https://minstar.github.io/OpenBioRQ/\" rel=\"nofollow\">https://minstar.github.io/OpenBioRQ/</a><br>dataset: <a href=\"https://huggingface.co/datasets/Minbyul/OpenBioRQ\">https://huggingface.co/datasets/Minbyul/OpenBioRQ</a></p>\n","updatedAt":"2026-06-26T03:01:18.301Z","author":{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","fullname":"Minbyul Jeong","name":"Minbyul","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5990283489227295},"editors":["Minbyul"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.21959","authors":[{"_id":"6a3deb013b43e283349ec1ab","name":"Minbyul Jeong","hidden":false}],"publishedAt":"2026-06-20T00:00:00.000Z","submittedOnDailyAt":"2026-06-26T00:00:00.000Z","title":"OpenBioRQ: Unsolved Biomedical Research Questions for Agents","submittedOnDailyBy":{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user","name":"Minbyul"},"summary":"A working citation looks like proof -- but the fact that a link resolves does not mean the cited paper supports the claim. I find that current agentic models rarely fabricate citations (over 99% resolve), yet roughly 15.9% link to the wrong paper. Existing benchmarks miss this failure mode: when a question has a fixed answer key, a model can reproduce the expected source from that key rather than independently verifying that the source supports the claim. I introduce \\openbiorq{}, a retrieval-grounded agentic benchmark of 12{,}553 unsolved biomedical research questions across 12 domains that treats open questions as a faithfulness-and-abstention probe. To my knowledge, this is the first biomedical benchmark to combine an agentic setting -- where the model must issue multiple tool calls -- with unsolved questions that have no answer key. Openness is verified against real follow-up evidence rather than a model's parametric knowledge. Difficulty is empirical: I anchor it on questions that three open-weight reference models fail to answer, rather than on subjective hardness labels. On this hardest subset, held-out models from the same lineage as the difficulty anchors solve only ~17%, while three independent frontier agents (Gemini-3-Pro, Opus-4.7, GPT-5.5) span a wide 29-60% range. The benchmark is thus hard, non-saturating (the best agent still leaves ~33-40\\% unsolved), and discriminating across capability tiers. Beyond difficulty, I observe agentic collapse on the hardest questions, where agents stop using their tools. For the most collapse-prone model, blocking tool access entirely barely changes its score -- so tools stop paying off exactly where they are needed most. A frozen per-question checklist raises inter-judge agreement from Spearman 0.35 to 0.82.","upvotes":3,"discussionId":"6a3deb013b43e283349ec1ac","projectPage":"https://minstar.github.io/OpenBioRQ/","githubRepo":"https://github.com/minstar/healthcare-research","githubRepoAddedBy":"user","ai_summary":"A new biomedical benchmark evaluates agentic models' ability to verify sources and avoid false citations by testing unsolved research questions with no answer keys, revealing significant failures in retrieval-grounded reasoning and tool usage.","ai_keywords":["agentic models","retrieval-grounded benchmark","biomedical research questions","citation verification","tool usage","agentic collapse","answer key","source verification","open-weight models","frontier agents"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"696da0962b3e2d9587d0b35d","avatarUrl":"/avatars/4f6c177ad51fb687ca1be75d18f6f5d6.svg","isPro":false,"fullname":"mini","user":"mini0999","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.21959.md","query":{}}">
OpenBioRQ: Unsolved Biomedical Research Questions for Agents
Abstract
A new biomedical benchmark evaluates agentic models' ability to verify sources and avoid false citations by testing unsolved research questions with no answer keys, revealing significant failures in retrieval-grounded reasoning and tool usage.
A working citation looks like proof -- but the fact that a link resolves does not mean the cited paper supports the claim. I find that current agentic models rarely fabricate citations (over 99% resolve), yet roughly 15.9% link to the wrong paper. Existing benchmarks miss this failure mode: when a question has a fixed answer key, a model can reproduce the expected source from that key rather than independently verifying that the source supports the claim. I introduce \openbiorq{}, a retrieval-grounded agentic benchmark of 12{,}553 unsolved biomedical research questions across 12 domains that treats open questions as a faithfulness-and-abstention probe. To my knowledge, this is the first biomedical benchmark to combine an agentic setting -- where the model must issue multiple tool calls -- with unsolved questions that have no answer key. Openness is verified against real follow-up evidence rather than a model's parametric knowledge. Difficulty is empirical: I anchor it on questions that three open-weight reference models fail to answer, rather than on subjective hardness labels. On this hardest subset, held-out models from the same lineage as the difficulty anchors solve only ~17%, while three independent frontier agents (Gemini-3-Pro, Opus-4.7, GPT-5.5) span a wide 29-60% range. The benchmark is thus hard, non-saturating (the best agent still leaves ~33-40\% unsolved), and discriminating across capability tiers. Beyond difficulty, I observe agentic collapse on the hardest questions, where agents stop using their tools. For the most collapse-prone model, blocking tool access entirely barely changes its score -- so tools stop paying off exactly where they are needed most. A frozen per-question checklist raises inter-judge agreement from Spearman 0.35 to 0.82.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.21959 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.21959 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.21959 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.