Hugging Face Daily Papers · June 23, 2026 · 4 min read

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/65377c30e48353201e6fdda0/jv4HdfoF5e8YDZQSu_jdD.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65377c30e48353201e6fdda0/jv4HdfoF5e8YDZQSu_jdD.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-06-23T04:23:37.723Z","author":{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","fullname":"Jiaheng Liu","name":"CheeryLJH","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":29,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6892319917678833},"editors":["CheeryLJH"],"editorAvatarUrls":["/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.22883","authors":[{"_id":"6a3a09ebfdcd3514343bb61c","name":"Zhanbo Hua","hidden":false},{"_id":"6a3a09ebfdcd3514343bb61d","name":"Yifan Yao","hidden":false},{"_id":"6a3a09ebfdcd3514343bb61e","name":"Weihao Xie","hidden":false},{"_id":"6a3a09ebfdcd3514343bb61f","name":"Yongchi Zhao","hidden":false},{"_id":"6a3a09ebfdcd3514343bb620","user":{"_id":"6417d9ea8f689506e7148417","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6417d9ea8f689506e7148417/bAYcruWNw4WvmuQcGgcwC.jpeg","isPro":false,"fullname":"minghao","user":"Liam-Liu","type":"user","name":"Liam-Liu"},"name":"Minghao Liu","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:14.672Z","hidden":false},{"_id":"6a3a09ebfdcd3514343bb621","name":"Ruizhi Qiu","hidden":false},{"_id":"6a3a09ebfdcd3514343bb622","name":"Zhewei Huang","hidden":false},{"_id":"6a3a09ebfdcd3514343bb623","name":"Zun Wang","hidden":false},{"_id":"6a3a09ebfdcd3514343bb624","name":"Yiyan Ji","hidden":false},{"_id":"6a3a09ebfdcd3514343bb625","name":"Yunhai Ye","hidden":false},{"_id":"6a3a09ebfdcd3514343bb626","name":"Letian Zhu","hidden":false},{"_id":"6a3a09ebfdcd3514343bb627","name":"Xinping Lei","hidden":false},{"_id":"6a3a09ebfdcd3514343bb628","name":"Han Li","hidden":false},{"_id":"6a3a09ebfdcd3514343bb629","name":"Zhiyuan Ma","hidden":false},{"_id":"6a3a09ebfdcd3514343bb62a","name":"Zili Wang","hidden":false},{"_id":"6a3a09ebfdcd3514343bb62b","name":"Zhaoxiang Zhang","hidden":false},{"_id":"6a3a09ebfdcd3514343bb62c","name":"Jiaheng Liu","hidden":false}],"publishedAt":"2026-06-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents","submittedOnDailyBy":{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user","name":"CheeryLJH"},"summary":"While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.","upvotes":27,"discussionId":"6a3a09ebfdcd3514343bb62d","ai_summary":"A principled synthesis engine generates high-quality terminal-agent tasks through multi-dimensional capability taxonomy and evidence-guided research, creating a distilled dataset that enables significant performance gains in LLM training.","ai_keywords":["LLM-based terminal agents","executable training data","synthesis engine","capability taxonomy","evidence-guided deep research","Dockerized environments","executable verification pipeline","rubric-gated test construction","fail-to-pass checking","Terminal-Bench 2.0","Qwen3-32B","CLI-Universe-6K"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user"},{"_id":"6879b0bf863c9c439bad8c4f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6879b0bf863c9c439bad8c4f/E-QFj9tXTg4_CbufSRI_C.jpeg","isPro":false,"fullname":"Zhanbo Hua","user":"ZhanboHua","type":"user"},{"_id":"660165de9e1cf5eb41fe4b0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660165de9e1cf5eb41fe4b0a/rpNxle6Px04AFTAomec0k.jpeg","isPro":false,"fullname":"Qianqian Xie","user":"mistletoe111","type":"user"},{"_id":"6849210e9e8f95397d320e15","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xTM8z6IlPzjmTWwKgu1u2.png","isPro":false,"fullname":"yaoyifan","user":"yyf12","type":"user"},{"_id":"6417d9ea8f689506e7148417","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6417d9ea8f689506e7148417/bAYcruWNw4WvmuQcGgcwC.jpeg","isPro":false,"fullname":"minghao","user":"Liam-Liu","type":"user"},{"_id":"66100bacac50abb8d56dece6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66100bacac50abb8d56dece6/fd-4VMpb_1nl903yAIK4K.jpeg","isPro":false,"fullname":"Ding Yue","user":"dingyue1011","type":"user"},{"_id":"6869d86da177ff0ff66936dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/p4M7eeD-N-1APm0dayOUL.jpeg","isPro":false,"fullname":"Contextbench","user":"Contextbench","type":"user"},{"_id":"67f9d060395fb1a0d7e4ae21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/GjpOfOuazN7IxcXBpVqRm.png","isPro":false,"fullname":"Shihao Li","user":"Leexeo","type":"user"},{"_id":"69bb935ccbaab29b582a87f5","avatarUrl":"/avatars/2ae66f812ec0cd313b47365887594b8a.svg","isPro":false,"fullname":"Tianzhuang He","user":"dabingzz","type":"user"},{"_id":"691d7b88208ffcf507951fd4","avatarUrl":"/avatars/f527d579062fbd2abcc0e8a4e1b885e1.svg","isPro":false,"fullname":"haoran xu","user":"xhrabc5678","type":"user"},{"_id":"658d0a228cff48d3a4612689","avatarUrl":"/avatars/156f04d0acfd8833bcd73b289a0bd791.svg","isPro":false,"fullname":"Bingli Wang","user":"BingliW","type":"user"},{"_id":"6682804247f284fc37e13edb","avatarUrl":"/avatars/4471e5c8aff2b9913631a0246ad4a826.svg","isPro":false,"fullname":"yidongw","user":"HaijiD","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.22883.md","query":{}}">

Papers

arxiv:2606.22883

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

Published on Jun 22

· Submitted by

Jiaheng Liu on Jun 23

NJU-LINK Lab

Upvote

Authors:

Minghao Liu ,

Abstract

A principled synthesis engine generates high-quality terminal-agent tasks through multi-dimensional capability taxonomy and evidence-guided research, creating a distilled dataset that enables significant performance gains in LLM training.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.