To be presented at ACL 2026 main (oral).</p>\n","updatedAt":"2026-05-27T15:14:19.601Z","author":{"_id":"60d3ab1507da9c17c7270917","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d3ab1507da9c17c7270917/x5FxIakR-okI5Csd1Sg7Q.png","fullname":"Delip Rao","name":"delip","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6451870799064636},"editors":["delip"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/60d3ab1507da9c17c7270917/x5FxIakR-okI5Csd1Sg7Q.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.08600","authors":[{"_id":"6a170a17da9422d403a421d0","name":"Delip Rao","hidden":false},{"_id":"6a170a17da9422d403a421d1","name":"Weiqiu You","hidden":false},{"_id":"6a170a17da9422d403a421d2","name":"Eric Wong","hidden":false},{"_id":"6a170a17da9422d403a421d3","name":"Chris Callison-Burch","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"NSF-SciFy: Mining the NSF Awards Database for Scientific Claims","submittedOnDailyBy":{"_id":"60d3ab1507da9c17c7270917","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d3ab1507da9c17c7270917/x5FxIakR-okI5Csd1Sg7Q.png","isPro":false,"fullname":"Delip Rao","user":"delip","type":"user","name":"delip"},"summary":"We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstracts. While previous scientific claim verification datasets have been limited in size and scope, NSF-SciFy represents a significant advance with 2.8 million claims from 400,000 abstracts spanning all science and mathematics disciplines. We present two focused subsets: NSF-SciFy-MatSci with 114,000 claims from materials science awards, and NSF-SciFy-20K with 135,000 claims across five NSF directorates. Using zero-shot prompting, we develop a scalable approach for joint extraction of scientific claims and investigation proposals. We demonstrate the dataset's utility through three downstream tasks: non-technical abstract generation, claim extraction, and investigation proposal extraction. Fine-tuning language models on our dataset yields substantial improvements, with relative gains often exceeding 100%, particularly for claim and proposal extraction tasks. Our error analysis reveals that extracted claims exhibit high precision but lower recall, suggesting opportunities for further methodological refinement. NSF-SciFy enables new research directions in large-scale claim verification, scientific discovery tracking, and meta-scientific analysis. Code and data are available at https://github.com/darpa-scify/NSFSciFy.","upvotes":1,"discussionId":"6a170a18da9422d403a421d4","githubRepo":"https://github.com/darpa-scify/NSFSciFy","githubRepoAddedBy":"user","ai_summary":"NSF-SciFy is a large-scale dataset of scientific claims and investigation proposals extracted from NSF award abstracts, enabling improved language model fine-tuning for claim verification and scientific discovery tracking.","ai_keywords":["scientific claims","investigation proposals","zero-shot prompting","language models","fine-tuning"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69a3f6ce54551aa754f60e98","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/LpJ1YD4Bwcaa-GJuKL1tI.png","isPro":false,"fullname":"Павлов Роман","user":"tangqianyi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2503/2503.08600.md"}">
NSF-SciFy: Mining the NSF Awards Database for Scientific Claims
Abstract
NSF-SciFy is a large-scale dataset of scientific claims and investigation proposals extracted from NSF award abstracts, enabling improved language model fine-tuning for claim verification and scientific discovery tracking.
AI-generated summary
We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstracts. While previous scientific claim verification datasets have been limited in size and scope, NSF-SciFy represents a significant advance with 2.8 million claims from 400,000 abstracts spanning all science and mathematics disciplines. We present two focused subsets: NSF-SciFy-MatSci with 114,000 claims from materials science awards, and NSF-SciFy-20K with 135,000 claims across five NSF directorates. Using zero-shot prompting, we develop a scalable approach for joint extraction of scientific claims and investigation proposals. We demonstrate the dataset's utility through three downstream tasks: non-technical abstract generation, claim extraction, and investigation proposal extraction. Fine-tuning language models on our dataset yields substantial improvements, with relative gains often exceeding 100%, particularly for claim and proposal extraction tasks. Our error analysis reveals that extracted claims exhibit high precision but lower recall, suggesting opportunities for further methodological refinement. NSF-SciFy enables new research directions in large-scale claim verification, scientific discovery tracking, and meta-scientific analysis. Code and data are available at https://github.com/darpa-scify/NSFSciFy.
Community
To be presented at ACL 2026 main (oral).
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2503.08600 in a model README.md to link it from this page.
Cite arxiv.org/abs/2503.08600 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2503.08600 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.