Hugging Face Daily Papers · May 22, 2026 · 3 min read

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

TerminalWorld is a scalable data engine that reverse-engineers real-world terminal recordings into a benchmark of 1,530 validated tasks to evaluate agent performance on authentic software engineering terminal workflows.</p>\n","updatedAt":"2026-05-22T02:03:27.230Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":303,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8874333500862122},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22535","authors":[{"_id":"6a0fb957a53a61ce2e422c2a","name":"Zhaoyang Chu","hidden":false},{"_id":"6a0fb957a53a61ce2e422c2b","name":"Jiarui Hu","hidden":false},{"_id":"6a0fb957a53a61ce2e422c2c","name":"Xingyu Jiang","hidden":false},{"_id":"6a0fb957a53a61ce2e422c2d","name":"Pengyu Zou","hidden":false},{"_id":"6a0fb957a53a61ce2e422c2e","name":"Han Li","hidden":false},{"_id":"6a0fb957a53a61ce2e422c2f","name":"Chao Peng","hidden":false},{"_id":"6a0fb957a53a61ce2e422c30","name":"Peter O'Hearn","hidden":false},{"_id":"6a0fb957a53a61ce2e422c31","name":"Earl T. Barr","hidden":false},{"_id":"6a0fb957a53a61ce2e422c32","name":"Mark Harman","hidden":false},{"_id":"6a0fb957a53a61ce2e422c33","name":"Federica Sarro","hidden":false},{"_id":"6a0fb957a53a61ce2e422c34","name":"He Ye","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from \"in-the-wild\" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.","upvotes":2,"discussionId":"6a0fb957a53a61ce2e422c35","githubRepo":"https://github.com/EuniAI/TerminalWorld","githubRepoAddedBy":"user","githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69ccb73d4ec277b44ab32395","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/NKjRTQFGjqJPVNcvUfZlT.png","isPro":false,"fullname":"Anthony HALL","user":"ella-rodriguez2","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22535.md"}">

Papers

arxiv:2605.22535

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Published on May 21

· Submitted by

taesiri on May 22

Upvote

Authors:

Abstract

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.

View arXiv page View PDF GitHub 1 Add to collection

Community

taesiri

Paper submitter about 10 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.22535

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22535 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.22535 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22535 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers