Most agent benchmarks use synthetic tasks. EnterpriseClawBench is distilled from a large archive of real proprietary workplace sessions, agents reading heterogeneous files, calling tools, and shipping actual business artifacts, turned into 852 reproducible tasks. We deliberately don't release the data; the reusable contribution is the construction and evaluation protocol, which you can run on your own private sessions. Even the best harness–model config (Codex + GPT-5.5) reaches only 0.663, and EnterpriseClawBench argues a single score hides what matters: harness–model pairing, artifact delivery, cost, runtime, and skill transfer.</p>\n","updatedAt":"2026-06-23T03:49:34.143Z","author":{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","fullname":"Kaiyan Zhang","name":"iseesaw","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8514317274093628},"editors":["iseesaw"],"editorAvatarUrls":["/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.23654","authors":[{"_id":"6a3a0213fdcd3514343bb586","name":"Jincheng Zhong","hidden":false},{"_id":"6a3a0213fdcd3514343bb587","name":"Weizhi Wang","hidden":false},{"_id":"6a3a0213fdcd3514343bb588","name":"Che Jiang","hidden":false},{"_id":"6a3a0213fdcd3514343bb589","name":"Kai Tian","hidden":false},{"_id":"6a3a0213fdcd3514343bb58a","name":"Zhenzhao Yuan","hidden":false},{"_id":"6a3a0213fdcd3514343bb58b","name":"Junlin Yang","hidden":false},{"_id":"6a3a0213fdcd3514343bb58c","name":"Dianqiao Lei","hidden":false},{"_id":"6a3a0213fdcd3514343bb58d","user":{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","isPro":false,"fullname":"Kaiyan Zhang","user":"iseesaw","type":"user","name":"iseesaw"},"name":"Kaiyan Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:30.920Z","hidden":false}],"publishedAt":"2026-06-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions","submittedOnDailyBy":{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","isPro":false,"fullname":"Kaiyan Zhang","user":"iseesaw","type":"user","name":"iseesaw"},"summary":"Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench","upvotes":57,"discussionId":"6a3a0213fdcd3514343bb58e","projectPage":"https://frontisai.github.io/EnterpriseClawBench/","githubRepo":"https://github.com/FrontisAI/EnterpriseClawBench","githubRepoAddedBy":"user","ai_summary":"EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores.","ai_keywords":["enterprise agent benchmark","reproducible tasks","workplace sessions","harness--model combinations","artifact delivery","skill-transfer behavior"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":28,"organization":{"_id":"6a32950e4b5c1c0ebee0e552","name":"FrontisAI","fullname":"Frontis AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bc94cd85a3ab33829b6211/1w_MutesbGw4NwNkA_dn5.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","isPro":false,"fullname":"Kaiyan Zhang","user":"iseesaw","type":"user"},{"_id":"64802b46c57f629056c578ee","avatarUrl":"/avatars/50748f7b782c763a23e4bf04869a3466.svg","isPro":false,"fullname":"yiyi","user":"cnwang","type":"user"},{"_id":"65697feb9fb2d79a79e14e0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65697feb9fb2d79a79e14e0a/wVGaBjn8pQIJneZWSFIwS.jpeg","isPro":false,"fullname":"haodi lei","user":"bingyang-lei","type":"user"},{"_id":"676c04f44464f476aaa53d1c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/k488J1893F3JGwEMvaeuh.png","isPro":false,"fullname":"Chong Xia","user":"xiac24","type":"user"},{"_id":"6458e8ce4b7baff9a84aa0da","avatarUrl":"/avatars/c450f4885e68d28c22fd87f9efdfedec.svg","isPro":false,"fullname":"kaikai zhao","user":"LifeIsSoSolong","type":"user"},{"_id":"662de78652e194d5d4b63d18","avatarUrl":"/avatars/3d74efd07258a7a8146ee673d752f9c8.svg","isPro":false,"fullname":"kuo","user":"zhangkuo2024","type":"user"},{"_id":"649ad703d6897b1e0ae99bd6","avatarUrl":"/avatars/42ecf2c2c27bf4de1b30a386572c8cb2.svg","isPro":false,"fullname":"AAA","user":"Emotion-Director","type":"user"},{"_id":"643389755277e3b24ef562f1","avatarUrl":"/avatars/9cb73af48dcf1d0daa016488e529e5f6.svg","isPro":false,"fullname":"xie","user":"orshi","type":"user"},{"_id":"64b73a82efefc8a7387aeb74","avatarUrl":"/avatars/89ca89b96e667f33042bba7ac2b24b56.svg","isPro":false,"fullname":"Yuanchun Zheng","user":"luckystar1992","type":"user"},{"_id":"6a3a074561d192c6bbade785","avatarUrl":"/avatars/f82412658c5cf1a42ba06b96ca234427.svg","isPro":false,"fullname":"Yisheng Zhang","user":"andyzys123","type":"user"},{"_id":"64d6fd4e505306fcd2cc098f","avatarUrl":"/avatars/305e097eac8cac76a34ec1bde64ee7b8.svg","isPro":false,"fullname":"zzyin","user":"fanshutou","type":"user"},{"_id":"663f07d029be04778ba97871","avatarUrl":"/avatars/fb7c9d4a2c537d918a3267e7cbc03f04.svg","isPro":false,"fullname":"Xingtai Lv","user":"XingtaiHF","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a32950e4b5c1c0ebee0e552","name":"FrontisAI","fullname":"Frontis AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bc94cd85a3ab33829b6211/1w_MutesbGw4NwNkA_dn5.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.23654.md","query":{}}">
EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions
Abstract
EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores.
Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench
Community
Most agent benchmarks use synthetic tasks. EnterpriseClawBench is distilled from a large archive of real proprietary workplace sessions, agents reading heterogeneous files, calling tools, and shipping actual business artifacts, turned into 852 reproducible tasks. We deliberately don't release the data; the reusable contribution is the construction and evaluation protocol, which you can run on your own private sessions. Even the best harness–model config (Codex + GPT-5.5) reaches only 0.663, and EnterpriseClawBench argues a single score hides what matters: harness–model pairing, artifact delivery, cost, runtime, and skill transfer.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.23654 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.23654 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.23654 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.