Hugging Face Daily Papers · June 23, 2026 · 4 min read

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

#model-release #agents #benchmark #funding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Most agent benchmarks use synthetic tasks. EnterpriseClawBench is distilled from a large archive of real proprietary workplace sessions, agents reading heterogeneous files, calling tools, and shipping actual business artifacts, turned into 852 reproducible tasks. We deliberately don't release the data; the reusable contribution is the construction and evaluation protocol, which you can run on your own private sessions. Even the best harness–model config (Codex + GPT-5.5) reaches only 0.663, and EnterpriseClawBench argues a single score hides what matters: harness–model pairing, artifact delivery, cost, runtime, and skill transfer.</p>\n","updatedAt":"2026-06-23T03:49:34.143Z","author":{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","fullname":"Kaiyan Zhang","name":"iseesaw","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8514317274093628},"editors":["iseesaw"],"editorAvatarUrls":["/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.23654","authors":[{"_id":"6a3a0213fdcd3514343bb586","name":"Jincheng Zhong","hidden":false},{"_id":"6a3a0213fdcd3514343bb587","name":"Weizhi Wang","hidden":false},{"_id":"6a3a0213fdcd3514343bb588","name":"Che Jiang","hidden":false},{"_id":"6a3a0213fdcd3514343bb589","name":"Kai Tian","hidden":false},{"_id":"6a3a0213fdcd3514343bb58a","name":"Zhenzhao Yuan","hidden":false},{"_id":"6a3a0213fdcd3514343bb58b","name":"Junlin Yang","hidden":false},{"_id":"6a3a0213fdcd3514343bb58c","name":"Dianqiao Lei","hidden":false},{"_id":"6a3a0213fdcd3514343bb58d","user":{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","isPro":false,"fullname":"Kaiyan Zhang","user":"iseesaw","type":"user","name":"iseesaw"},"name":"Kaiyan Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:30.920Z","hidden":false}],"publishedAt":"2026-06-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions","submittedOnDailyBy":{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","isPro":false,"fullname":"Kaiyan Zhang","user":"iseesaw","type":"user","name":"iseesaw"},"summary":"Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench","upvotes":57,"discussionId":"6a3a0213fdcd3514343bb58e","projectPage":"https://frontisai.github.io/EnterpriseClawBench/","githubRepo":"https://github.com/FrontisAI/EnterpriseClawBench","githubRepoAddedBy":"user","ai_summary":"EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores.","ai_keywords":["enterprise agent benchmark","reproducible tasks","workplace sessions","harness--model combinations","artifact delivery","skill-transfer behavior"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":28,"organization":{"_id":"6a32950e4b5c1c0ebee0e552","name":"FrontisAI","fullname":"Frontis AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bc94cd85a3ab33829b6211/1w_MutesbGw4NwNkA_dn5.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","isPro":false,"fullname":"Kaiyan Zhang","user":"iseesaw","type":"user"},{"_id":"64802b46c57f629056c578ee","avatarUrl":"/avatars/50748f7b782c763a23e4bf04869a3466.svg","isPro":false,"fullname":"yiyi","user":"cnwang","type":"user"},{"_id":"65697feb9fb2d79a79e14e0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65697feb9fb2d79a79e14e0a/wVGaBjn8pQIJneZWSFIwS.jpeg","isPro":false,"fullname":"haodi lei","user":"bingyang-lei","type":"user"},{"_id":"676c04f44464f476aaa53d1c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/k488J1893F3JGwEMvaeuh.png","isPro":false,"fullname":"Chong Xia","user":"xiac24","type":"user"},{"_id":"6458e8ce4b7baff9a84aa0da","avatarUrl":"/avatars/c450f4885e68d28c22fd87f9efdfedec.svg","isPro":false,"fullname":"kaikai zhao","user":"LifeIsSoSolong","type":"user"},{"_id":"662de78652e194d5d4b63d18","avatarUrl":"/avatars/3d74efd07258a7a8146ee673d752f9c8.svg","isPro":false,"fullname":"kuo","user":"zhangkuo2024","type":"user"},{"_id":"649ad703d6897b1e0ae99bd6","avatarUrl":"/avatars/42ecf2c2c27bf4de1b30a386572c8cb2.svg","isPro":false,"fullname":"AAA","user":"Emotion-Director","type":"user"},{"_id":"643389755277e3b24ef562f1","avatarUrl":"/avatars/9cb73af48dcf1d0daa016488e529e5f6.svg","isPro":false,"fullname":"xie","user":"orshi","type":"user"},{"_id":"64b73a82efefc8a7387aeb74","avatarUrl":"/avatars/89ca89b96e667f33042bba7ac2b24b56.svg","isPro":false,"fullname":"Yuanchun Zheng","user":"luckystar1992","type":"user"},{"_id":"6a3a074561d192c6bbade785","avatarUrl":"/avatars/f82412658c5cf1a42ba06b96ca234427.svg","isPro":false,"fullname":"Yisheng Zhang","user":"andyzys123","type":"user"},{"_id":"64d6fd4e505306fcd2cc098f","avatarUrl":"/avatars/305e097eac8cac76a34ec1bde64ee7b8.svg","isPro":false,"fullname":"zzyin","user":"fanshutou","type":"user"},{"_id":"663f07d029be04778ba97871","avatarUrl":"/avatars/fb7c9d4a2c537d918a3267e7cbc03f04.svg","isPro":false,"fullname":"Xingtai Lv","user":"XingtaiHF","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a32950e4b5c1c0ebee0e552","name":"FrontisAI","fullname":"Frontis AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bc94cd85a3ab33829b6211/1w_MutesbGw4NwNkA_dn5.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.23654.md","query":{}}">

Papers

arxiv:2606.23654

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Published on Jun 22

· Submitted by

Kaiyan Zhang on Jun 23

Frontis AI

Upvote

Authors:

Kaiyan Zhang

Abstract

EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench