Accepted at ICML 2026.</p>\n<p>Spreadsheets are everywhere — budgets, schedules, reports, dashboards — yet building even a simple one takes hundreds of small clicks: typing values, picking colors, drawing borders, copying styles. Programmers get powerful autocomplete (like GitHub Copilot) that quietly suggests the next few lines as they type, but spreadsheet users get nothing similar. Why? Because there's no good way to test such a system: nobody has recorded what real users do, step by step, while building a spreadsheet, and the actions span colors, layouts, and formulas all at once — so it's hard to even define what \"the right next suggestion\" means.</p>\n<p>We tackle both problems. First, we built a benchmark of 52 spreadsheets with the full step-by-step recipe a person would follow to make each one — 12,000 actions total, hand-checked. Second, we built a testing framework that mimics a real user: at every step the system makes a suggestion, the simulated user accepts or rejects it, and the remaining work is updated to reflect what was just done (or undone).</p>\n<p>Using this, we tested several frontier LLMs, small fine-tuned models, and simple statistical methods. Today's best AI can already save a user roughly one in three clicks, and even tiny models can match it after a little training. The benchmark and tools we release let researchers measure exactly where these assistants help, where they hurt, and what would make them genuinely useful in everyday spreadsheet work.</p>\n<p>📄 arXiv: <a href=\"https://arxiv.org/abs/2606.13802\" rel=\"nofollow\">https://arxiv.org/abs/2606.13802</a> · 💻 Code & benchmark: <a href=\"https://github.com/Tej-55/NAPE\" rel=\"nofollow\">https://github.com/Tej-55/NAPE</a></p>\n","updatedAt":"2026-06-18T12:21:13.433Z","author":{"_id":"69f5ef7c1f33b9e6cfc5fb77","avatarUrl":"/avatars/b901046ea95900e87d252d5698ace9a9.svg","fullname":"Tejas Agrawal","name":"Tej-a55","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9249029755592346},"editors":["Tej-a55"],"editorAvatarUrls":["/avatars/b901046ea95900e87d252d5698ace9a9.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.13802","authors":[{"_id":"6a32d14359127a45e2c1c439","user":{"_id":"69f5ef7c1f33b9e6cfc5fb77","avatarUrl":"/avatars/b901046ea95900e87d252d5698ace9a9.svg","isPro":false,"fullname":"Tejas Agrawal","user":"Tej-a55","type":"user","name":"Tej-a55"},"name":"Tejas Agrawal","status":"claimed_verified","statusLastChangedAt":"2026-06-18T11:53:16.844Z","hidden":false},{"_id":"6a32d14359127a45e2c1c43a","name":"Vu Le","hidden":false},{"_id":"6a32d14359127a45e2c1c43b","name":"Sumit Gulwani","hidden":false},{"_id":"6a32d14359127a45e2c1c43c","name":"Gust Verbruggen","hidden":false}],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets","submittedOnDailyBy":{"_id":"69f5ef7c1f33b9e6cfc5fb77","avatarUrl":"/avatars/b901046ea95900e87d252d5698ace9a9.svg","isPro":false,"fullname":"Tejas Agrawal","user":"Tej-a55","type":"user","name":"Tej-a55"},"summary":"Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.","upvotes":0,"discussionId":"6a32d14359127a45e2c1c43d","projectPage":"https://napeval.github.io/","githubRepo":"https://github.com/Tej-55/NAPE","githubRepoAddedBy":"user","ai_summary":"A benchmark for predicting spreadsheet user actions is introduced, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology.","ai_keywords":[""],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.13802.md","query":{}}">
A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets
Abstract
A benchmark for predicting spreadsheet user actions is introduced, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology.
Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.
Community
Accepted at ICML 2026.
Spreadsheets are everywhere — budgets, schedules, reports, dashboards — yet building even a simple one takes hundreds of small clicks: typing values, picking colors, drawing borders, copying styles. Programmers get powerful autocomplete (like GitHub Copilot) that quietly suggests the next few lines as they type, but spreadsheet users get nothing similar. Why? Because there's no good way to test such a system: nobody has recorded what real users do, step by step, while building a spreadsheet, and the actions span colors, layouts, and formulas all at once — so it's hard to even define what "the right next suggestion" means.
We tackle both problems. First, we built a benchmark of 52 spreadsheets with the full step-by-step recipe a person would follow to make each one — 12,000 actions total, hand-checked. Second, we built a testing framework that mimics a real user: at every step the system makes a suggestion, the simulated user accepts or rejects it, and the remaining work is updated to reflect what was just done (or undone).
Using this, we tested several frontier LLMs, small fine-tuned models, and simple statistical methods. Today's best AI can already save a user roughly one in three clicks, and even tiny models can match it after a little training. The benchmark and tools we release let researchers measure exactly where these assistants help, where they hurt, and what would make them genuinely useful in everyday spreadsheet work.
📄 arXiv: https://arxiv.org/abs/2606.13802 · 💻 Code & benchmark: https://github.com/Tej-55/NAPE
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.13802 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.13802 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.13802 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.