We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude Haiku 4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans.</p>\n","updatedAt":"2026-06-26T07:23:21.139Z","author":{"_id":"630b39e8910e17bbfea8436d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630b39e8910e17bbfea8436d/JvxwqDb4MuGKDXGxsmm2I.jpeg","fullname":"Issa Sugiura","name":"speed","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9085912704467773},"editors":["speed"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/630b39e8910e17bbfea8436d/JvxwqDb4MuGKDXGxsmm2I.jpeg"],"reactions":[],"isReport":false}},{"id":"6a3f2af4e0b29f2325c4d66a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":371,"isUserFollowing":false},"createdAt":"2026-06-27T01:44:20.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments](https://huggingface.co/papers/2606.15862) (2026)\n* [Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces](https://huggingface.co/papers/2605.17698) (2026)\n* [Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy](https://huggingface.co/papers/2606.08367) (2026)\n* [Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust](https://huggingface.co/papers/2605.10059) (2026)\n* [Benchmarking Open-Ended Multi-Agent Coordination in Language Agents](https://huggingface.co/papers/2606.08340) (2026)\n* [SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems](https://huggingface.co/papers/2606.03544) (2026)\n* [Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining](https://huggingface.co/papers/2605.14537) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2606.15862\">RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.17698\">Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.08367\">Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.10059\">Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.08340\">Benchmarking Open-Ended Multi-Agent Coordination in Language Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.03544\">SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.14537\">Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code></p>\n","updatedAt":"2026-06-27T01:44:20.291Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":371,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.749583899974823},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.16613","authors":[{"_id":"6a3bbf6d5ac9fb074498497a","user":{"_id":"630b39e8910e17bbfea8436d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630b39e8910e17bbfea8436d/JvxwqDb4MuGKDXGxsmm2I.jpeg","isPro":false,"fullname":"Issa Sugiura","user":"speed","type":"user","name":"speed"},"name":"Issa Sugiura","status":"claimed_verified","statusLastChangedAt":"2026-06-25T09:28:23.483Z","hidden":false},{"_id":"6a3bbf6d5ac9fb074498497b","name":"Daichi Hattori","hidden":false},{"_id":"6a3bbf6d5ac9fb074498497c","name":"Kazuo Araragi","hidden":false},{"_id":"6a3bbf6d5ac9fb074498497d","name":"Keita Ogawa","hidden":false},{"_id":"6a3bbf6d5ac9fb074498497e","name":"Shota Onose","hidden":false},{"_id":"6a3bbf6d5ac9fb074498497f","name":"Taro Makino","hidden":false},{"_id":"6a3bbf6d5ac9fb0744984980","name":"Teppei Usuki","hidden":false},{"_id":"6a3bbf6d5ac9fb0744984981","user":{"_id":"630594ee99870e13d3e0aef1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630594ee99870e13d3e0aef1/PTZPjgOA1UmJ7EG4P8YE-.webp","isPro":false,"fullname":"Takashi Ishida","user":"tksii","type":"user","name":"tksii"},"name":"Takashi Ishida","status":"claimed_verified","statusLastChangedAt":"2026-06-27T15:25:23.474Z","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-26T00:00:00.000Z","title":"CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies","submittedOnDailyBy":{"_id":"630b39e8910e17bbfea8436d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630b39e8910e17bbfea8436d/JvxwqDb4MuGKDXGxsmm2I.jpeg","isPro":false,"fullname":"Issa Sugiura","user":"speed","type":"user","name":"speed"},"summary":"As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.","upvotes":7,"discussionId":"6a3bbf6d5ac9fb0744984982","projectPage":"https://pub.sakana.ai/coffeebench/index.html","githubRepo":"https://github.com/SakanaAI/CoffeeBench","githubRepoAddedBy":"user","ai_summary":"CoffeeBench evaluates LLM agents in a multi-agent economic simulation where firms interact over 90 days to maximize profits, revealing differences in communication patterns and performance among various models.","ai_keywords":["LLM agents","multi-agent economy","long-horizon tasks","economic systems","autonomous agents","communication","negotiation","transact","cumulative net income","agent behavior","performance evaluation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":14},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630b39e8910e17bbfea8436d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630b39e8910e17bbfea8436d/JvxwqDb4MuGKDXGxsmm2I.jpeg","isPro":false,"fullname":"Issa Sugiura","user":"speed","type":"user"},{"_id":"630594ee99870e13d3e0aef1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630594ee99870e13d3e0aef1/PTZPjgOA1UmJ7EG4P8YE-.webp","isPro":false,"fullname":"Takashi Ishida","user":"tksii","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"651c80a26ba9ab9b9582c273","avatarUrl":"/avatars/e963452eafd21f517d800f2e58e0f918.svg","isPro":false,"fullname":"siyeng feng","user":"siyengfeng","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"69f0bb9a53592156859aab90","avatarUrl":"/avatars/122aeb140c584b7842c50ae693c2a27e.svg","isPro":false,"fullname":"mini09999","user":"mini09999","type":"user"},{"_id":"696da0962b3e2d9587d0b35d","avatarUrl":"/avatars/4f6c177ad51fb687ca1be75d18f6f5d6.svg","isPro":false,"fullname":"mini","user":"mini0999","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.16613.md","query":{}}">
CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
Abstract
CoffeeBench evaluates LLM agents in a multi-agent economic simulation where firms interact over 90 days to maximize profits, revealing differences in communication patterns and performance among various models.
As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.
Community
We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude Haiku 4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.16613 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.16613 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.16613 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.