Hugging Face Daily Papers · · 14 min read

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/697836bddd0add0478cedb28/lnFsRifHxidCvrSMlGUc3.png\" alt=\"How Far Can Today’s Agentic Models Go on the Ramp to Real System Development?\" width=\"600\"/>\n <p><i>How Far Can Today’s Agentic Models Go on the Ramp to Real System Development?</i></p>\n</div>\n\n### 🛠️ The Workload: Real-World Compiler Construction\nInstead of isolated coding puzzles, RAMP evaluates agents on a 6-stage compiler-construction pipeline (based on YatCC). The tasks range from environment setup (T0) and lexer generation (T1) all the way to LLVM IR optimization (T4) and RV64 assembly generation (T5). Each task consumes the output artifact of its predecessor, creating a strict serial dependency chain.\n\n### ✨ Core Innovation: The \"Resurrection Protocol\"\nIn long-horizon tasks, a failure at an early stage usually invalidates all downstream steps, obscuring the model's true capabilities. To solve this, RAMP introduces a **Resurrection Protocol**. \n\nWhen an agent fails an intermediate task, the orchestrator automatically transparently injects a \"golden artifact\" (a perfect intermediate state) and lets the agent continue. This allows us to separate \"cannot reach\" from \"cannot solve,\" providing unprecedented diagnostic granularity.\n\n<div align=\"center\">\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/697836bddd0add0478cedb28/rAVa4ePRCixCH_9FMehJB.png\" alt=\"Figure 3: Long-horizon assessment workloads in the integrated pipeline of RAMP\" width=\"600\"/>\n <p><i>Figure 3: The RAMP Pipeline demonstrating Serial Evolution and the Resurrection Protocol.</i></p>\n</div>\n\n### 📊 Shocking Findings from 15 SOTA Models\nWe evaluated 15 models (including `Opus-4.7`, `GPT-5.5`, `DeepSeek-v4-Pro`, and `Qwen-3.6-Max`). The results were eye-opening:\n\n* **A Clear Capability Ceiling:** **None** of the 15 evaluated models successfully completed the entire pipeline. Even the top-performing model, Opus-4.7, stalled at the IR Generation stage.\n* **The 2525x Efficiency Gap:** Process efficiency varied wildly. Total inference costs ranged from $0.05 (Qwen3-Coder) to $126.24 (Opus-4.7) — an extreme **2525x difference**.\n* **The \"Context\" Killer:** We mapped a detailed failure taxonomy and discovered that **Context Failure** is the most prevalent hard-stop reason (accounting for 60.0% of failures), predominantly occurring in the middle stages (T2-T3) as history and code artifacts accumulate.\n\n<div align=\"center\">\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/697836bddd0add0478cedb28/UZlh7ET7rGoovzm30Mzrh.png\" alt=\"Figure 5: Trade-off of cost and performance\" width=\"600\"/>\n <p><i>Figure 5: Trade-off of Cost and Performance: Elapsed time and API cost versus mean reward.</i></p>\n</div>\n\n### ⚖️ Beyond Accuracy: The Agent Efficiency Index (AEI)\nIn production, a model that brute-forces a solution using massive context and time isn't always the best choice. We propose the **Agent Efficiency Index (AEI)**, a composite metric jointly measuring task effectiveness, time, cost, and token utilization. \n\nUnder AEI, the rankings flip: `GPT-5.5` achieved the highest composite efficiency (AEI 81.57), whereas `Opus-4.7`, despite having the highest raw task reward, dropped to an AEI of 40.00 due to massive resource overhead.\n\n**Read the full paper to explore the deep diagnostics of model behavior and why we need to move past static benchmarks!** \n- 🔗 **Code:** https://github.com/Nexa-Language/RAMP\n- 💻 **Code & Leaderboard:** http://ramp.yatcc-ai.com/","html":"<h1 class=\"relative group flex items-baseline\">\n\t<a id=\"🚀-benchmarks-are-not-enough-ramp-for-evaluating-agentic-models-in-real-world-production-systems-with-ramp\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🚀-benchmarks-are-not-enough-ramp-for-evaluating-agentic-models-in-real-world-production-systems-with-ramp\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🚀 <strong>Benchmarks are Not Enough: RAMP</strong> for Evaluating Agentic Models in Real-World Production Systems with RAMP\n\t</span>\n</h1>\n<p>Hi Hugging Face Community! 👋 We are excited to share our latest work that challenges the current paradigm of LLM agent evaluation benchmarks. </p>\n<p>While Large Language Models (LLMs) are rapidly evolving into autonomous software engineering systems, existing evaluation methodologies are largely centered on static, isolated, and short-horizon benchmarks. We found that high scores on traditional benchmarks poorly reflect practical capabilities under realistic runtime environments that involve long execution chains, tool interactions, and iterative feedback loops. </p>\n<p>To bridge this gap, we introduce <strong>RAMP (Runtime Assessment of Models in Production)</strong>, an infrastructure designed to evaluate agents in continuous, stateful, and resource-constrained engineering workflows.</p>\n<div align=\"center\">\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/697836bddd0add0478cedb28/lnFsRifHxidCvrSMlGUc3.png\" alt=\"How Far Can Today’s Agentic Models Go on the Ramp to Real System Development?\" width=\"600\">\n <p><i>How Far Can Today’s Agentic Models Go on the Ramp to Real System Development?</i></p>\n</div>\n\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"🛠️-the-workload-real-world-compiler-construction\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🛠️-the-workload-real-world-compiler-construction\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🛠️ The Workload: Real-World Compiler Construction\n\t</span>\n</h3>\n<p>Instead of isolated coding puzzles, RAMP evaluates agents on a 6-stage compiler-construction pipeline (based on YatCC). The tasks range from environment setup (T0) and lexer generation (T1) all the way to LLVM IR optimization (T4) and RV64 assembly generation (T5). Each task consumes the output artifact of its predecessor, creating a strict serial dependency chain.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"✨-core-innovation-the-resurrection-protocol\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#✨-core-innovation-the-resurrection-protocol\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t✨ Core Innovation: The \"Resurrection Protocol\"\n\t</span>\n</h3>\n<p>In long-horizon tasks, a failure at an early stage usually invalidates all downstream steps, obscuring the model's true capabilities. To solve this, RAMP introduces a <strong>Resurrection Protocol</strong>. </p>\n<p>When an agent fails an intermediate task, the orchestrator automatically transparently injects a \"golden artifact\" (a perfect intermediate state) and lets the agent continue. This allows us to separate \"cannot reach\" from \"cannot solve,\" providing unprecedented diagnostic granularity.</p>\n<div align=\"center\">\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/697836bddd0add0478cedb28/rAVa4ePRCixCH_9FMehJB.png\" alt=\"Figure 3: Long-horizon assessment workloads in the integrated pipeline of RAMP\" width=\"600\">\n <p><i>Figure 3: The RAMP Pipeline demonstrating Serial Evolution and the Resurrection Protocol.</i></p>\n</div>\n\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"📊-shocking-findings-from-15-sota-models\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#📊-shocking-findings-from-15-sota-models\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t📊 Shocking Findings from 15 SOTA Models\n\t</span>\n</h3>\n<p>We evaluated 15 models (including <code>Opus-4.7</code>, <code>GPT-5.5</code>, <code>DeepSeek-v4-Pro</code>, and <code>Qwen-3.6-Max</code>). The results were eye-opening:</p>\n<ul>\n<li><strong>A Clear Capability Ceiling:</strong> <strong>None</strong> of the 15 evaluated models successfully completed the entire pipeline. Even the top-performing model, Opus-4.7, stalled at the IR Generation stage.</li>\n<li><strong>The 2525x Efficiency Gap:</strong> Process efficiency varied wildly. Total inference costs ranged from $0.05 (Qwen3-Coder) to $126.24 (Opus-4.7) — an extreme <strong>2525x difference</strong>.</li>\n<li><strong>The \"Context\" Killer:</strong> We mapped a detailed failure taxonomy and discovered that <strong>Context Failure</strong> is the most prevalent hard-stop reason (accounting for 60.0% of failures), predominantly occurring in the middle stages (T2-T3) as history and code artifacts accumulate.</li>\n</ul>\n<div align=\"center\">\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/697836bddd0add0478cedb28/UZlh7ET7rGoovzm30Mzrh.png\" alt=\"Figure 5: Trade-off of cost and performance\" width=\"600\">\n <p><i>Figure 5: Trade-off of Cost and Performance: Elapsed time and API cost versus mean reward.</i></p>\n</div>\n\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"⚖️-beyond-accuracy-the-agent-efficiency-index-aei\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#⚖️-beyond-accuracy-the-agent-efficiency-index-aei\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t⚖️ Beyond Accuracy: The Agent Efficiency Index (AEI)\n\t</span>\n</h3>\n<p>In production, a model that brute-forces a solution using massive context and time isn't always the best choice. We propose the <strong>Agent Efficiency Index (AEI)</strong>, a composite metric jointly measuring task effectiveness, time, cost, and token utilization. </p>\n<p>Under AEI, the rankings flip: <code>GPT-5.5</code> achieved the highest composite efficiency (AEI 81.57), whereas <code>Opus-4.7</code>, despite having the highest raw task reward, dropped to an AEI of 40.00 due to massive resource overhead.</p>\n<p><strong>Read the full paper to explore the deep diagnostics of model behavior and why we need to move past static benchmarks!</strong> </p>\n<ul>\n<li>🔗 <strong>Code:</strong> <a href=\"https://github.com/Nexa-Language/RAMP\" rel=\"nofollow\">https://github.com/Nexa-Language/RAMP</a></li>\n<li>💻 <strong>Code &amp; Leaderboard:</strong> <a href=\"http://ramp.yatcc-ai.com/\" rel=\"nofollow\">http://ramp.yatcc-ai.com/</a></li>\n</ul>\n","updatedAt":"2026-06-04T06:20:57.186Z","author":{"_id":"697836bddd0add0478cedb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/697836bddd0add0478cedb28/ofhxMwOJCaQ4BqQN6Zk2q.jpeg","fullname":"Ouyang Yipeng","name":"Fernandez-Owen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7703089714050293},"editors":["Fernandez-Owen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/697836bddd0add0478cedb28/ofhxMwOJCaQ4BqQN6Zk2q.jpeg"],"reactions":[{"reaction":"🔥","users":["Fernandez-Owen","wenz20101","HUANG-XIN"],"count":3},{"reaction":"🚀","users":["Fernandez-Owen","wenz20101","HUANG-XIN"],"count":3},{"reaction":"❤️","users":["Fernandez-Owen","wenz20101","HUANG-XIN"],"count":3},{"reaction":"👍","users":["Fernandez-Owen","HUANG-XIN"],"count":2}],"isReport":false}},{"id":"6a212d8d9aaa0703a0aaa974","author":{"_id":"6926c8b41f6c90e37c409a5d","avatarUrl":"/avatars/9799a4abf58c39d21fc3a457192c4bd4.svg","fullname":"HUANG XIN","name":"HUANG-XIN","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-04T07:47:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The workload is based on YatCC compiler-construction pipeline, which is a widely-used engineering practice project. Beyond that, the YatCC Platform provides more intelligent services.\nMore Infomation: https://yatcc-ai.com","html":"<p>The workload is based on YatCC compiler-construction pipeline, which is a widely-used engineering practice project. Beyond that, the YatCC Platform provides more intelligent services.<br>More Infomation: <a href=\"https://yatcc-ai.com\" rel=\"nofollow\">https://yatcc-ai.com</a></p>\n","updatedAt":"2026-06-04T07:47:25.006Z","author":{"_id":"6926c8b41f6c90e37c409a5d","avatarUrl":"/avatars/9799a4abf58c39d21fc3a457192c4bd4.svg","fullname":"HUANG XIN","name":"HUANG-XIN","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9067291617393494},"editors":["HUANG-XIN"],"editorAvatarUrls":["/avatars/9799a4abf58c39d21fc3a457192c4bd4.svg"],"reactions":[{"reaction":"🔥","users":["Fernandez-Owen"],"count":1},{"reaction":"❤️","users":["Fernandez-Owen"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27492","authors":[{"_id":"6a210d6e15100c5272a847c6","name":"Yipeng Ouyang","hidden":false},{"_id":"6a210d6e15100c5272a847c7","name":"Xin Huang","hidden":false},{"_id":"6a210d6e15100c5272a847c8","name":"Bingjie Liu","hidden":false},{"_id":"6a210d6e15100c5272a847c9","name":"Zhongchun Zheng","hidden":false},{"_id":"6a210d6e15100c5272a847ca","name":"Yuhao Gu","hidden":false},{"_id":"6a210d6e15100c5272a847cb","name":"Xianwei Zhang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/697836bddd0add0478cedb28/EOkw_HIYxnBt5Fb8_juhM.mp4","https://cdn-uploads.huggingface.co/production/uploads/697836bddd0add0478cedb28/Wi5gCLpelKnHEUZoWBx2B.png","https://cdn-uploads.huggingface.co/production/uploads/697836bddd0add0478cedb28/th3DCUo_F2-R3PTpeLEk9.png"],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems","submittedOnDailyBy":{"_id":"697836bddd0add0478cedb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/697836bddd0add0478cedb28/ofhxMwOJCaQ4BqQN6Zk2q.jpeg","isPro":false,"fullname":"Ouyang Yipeng","user":"Fernandez-Owen","type":"user","name":"Fernandez-Owen"},"summary":"LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.","upvotes":16,"discussionId":"6a210d6e15100c5272a847cc","projectPage":"http://ramp.yatcc-ai.com/","githubRepo":"https://github.com/Nexa-Language/RAMP","githubRepoAddedBy":"user","ai_summary":"Production-grounded evaluation framework RAMP assesses long-horizon software engineering agents through realistic compiler construction workloads and runtime analysis.","ai_keywords":["LLM agents","software engineering agents","production-grounded evaluation","runtime assessment","orchestration interfaces","execution interfaces","compiler-construction workloads","serial dependencies","toolchain interactions","staged recovery mechanism","multi-dimensional metrics","task completion rates","failure propagation","resource inefficiencies"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"656b30b8edd446c42b243426","name":"SunYatsen","fullname":"Sun Yat-Sen University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/61ac8f8a00d01045fca0ad2f/Mn9lkuoOwVkUVziPpg2XZ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"697836bddd0add0478cedb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/697836bddd0add0478cedb28/ofhxMwOJCaQ4BqQN6Zk2q.jpeg","isPro":false,"fullname":"Ouyang Yipeng","user":"Fernandez-Owen","type":"user"},{"_id":"6926c8b41f6c90e37c409a5d","avatarUrl":"/avatars/9799a4abf58c39d21fc3a457192c4bd4.svg","isPro":false,"fullname":"HUANG XIN","user":"HUANG-XIN","type":"user"},{"_id":"6a211b401a84962e40a0a0e3","avatarUrl":"/avatars/84642896e2b5e2798aba422b15aa9af1.svg","isPro":false,"fullname":"陈龙","user":"L00ngloong","type":"user"},{"_id":"6a01cb8b5f1bb7e990954913","avatarUrl":"/avatars/9dee459c56fdb8f6a5bdb83c59ea8944.svg","isPro":false,"fullname":"xianweiz","user":"xianweiz","type":"user"},{"_id":"657f1bfff5eacd4bda11d031","avatarUrl":"/avatars/06104872292b91f21add3f8e0a36ef57.svg","isPro":false,"fullname":"Nick Robert","user":"betterThanDoge","type":"user"},{"_id":"69b39ef5e699f15aa02a39b7","avatarUrl":"/avatars/56af5587f762750a78bdfe61f34493c4.svg","isPro":false,"fullname":"RUIXI ZHONG","user":"RUIXIZ","type":"user"},{"_id":"6a211d9fdfb741428bf70c25","avatarUrl":"/avatars/a2cee2ead58975a4a72ed154ace7be30.svg","isPro":false,"fullname":"un-joli-chat","user":"Je-suis-un-chat","type":"user"},{"_id":"680b56731619e52c1a097352","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/zO-nyesVUsbSwg6U9SunL.png","isPro":false,"fullname":"Wuqf","user":"cookiesheep","type":"user"},{"_id":"6a211ffc6c804f4e26ba7a3e","avatarUrl":"/avatars/683eee4795c02536ff36c644f14c3906.svg","isPro":false,"fullname":"Xiao Yi","user":"xiaoyi24","type":"user"},{"_id":"6a01b8f7cb108ed4c9d14305","avatarUrl":"/avatars/b356496e99054345cd160b40c8830b98.svg","isPro":false,"fullname":"Yuhao Gu","user":"yhgu2000","type":"user"},{"_id":"6a211c96fc014d49b7d143c6","avatarUrl":"/avatars/92423331731ecb7c36ec9e6a81e608f0.svg","isPro":false,"fullname":"mafuquan","user":"FinnMa","type":"user"},{"_id":"665c91e15b11dca02f0c5891","avatarUrl":"/avatars/49a4ee76c3edfe5b0916051a5ac4acfd.svg","isPro":false,"fullname":"Ye Huang","user":"henry-y1","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"656b30b8edd446c42b243426","name":"SunYatsen","fullname":"Sun Yat-Sen University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/61ac8f8a00d01045fca0ad2f/Mn9lkuoOwVkUVziPpg2XZ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27492.md"}">
Papers
arxiv:2605.27492

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

Published on May 26
· Submitted by
Ouyang Yipeng
on Jun 4
Authors:
,
,
,
,
,

Abstract

Production-grounded evaluation framework RAMP assesses long-horizon software engineering agents through realistic compiler construction workloads and runtime analysis.

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.

Community

🚀 Benchmarks are Not Enough: RAMP for Evaluating Agentic Models in Real-World Production Systems with RAMP

Hi Hugging Face Community! 👋 We are excited to share our latest work that challenges the current paradigm of LLM agent evaluation benchmarks.

While Large Language Models (LLMs) are rapidly evolving into autonomous software engineering systems, existing evaluation methodologies are largely centered on static, isolated, and short-horizon benchmarks. We found that high scores on traditional benchmarks poorly reflect practical capabilities under realistic runtime environments that involve long execution chains, tool interactions, and iterative feedback loops.

To bridge this gap, we introduce RAMP (Runtime Assessment of Models in Production), an infrastructure designed to evaluate agents in continuous, stateful, and resource-constrained engineering workflows.

How Far Can Today’s Agentic Models Go on the Ramp to Real System Development?

How Far Can Today’s Agentic Models Go on the Ramp to Real System Development?

🛠️ The Workload: Real-World Compiler Construction

Instead of isolated coding puzzles, RAMP evaluates agents on a 6-stage compiler-construction pipeline (based on YatCC). The tasks range from environment setup (T0) and lexer generation (T1) all the way to LLVM IR optimization (T4) and RV64 assembly generation (T5). Each task consumes the output artifact of its predecessor, creating a strict serial dependency chain.

✨ Core Innovation: The "Resurrection Protocol"

In long-horizon tasks, a failure at an early stage usually invalidates all downstream steps, obscuring the model's true capabilities. To solve this, RAMP introduces a Resurrection Protocol.

When an agent fails an intermediate task, the orchestrator automatically transparently injects a "golden artifact" (a perfect intermediate state) and lets the agent continue. This allows us to separate "cannot reach" from "cannot solve," providing unprecedented diagnostic granularity.

Figure 3: Long-horizon assessment workloads in the integrated pipeline of RAMP

Figure 3: The RAMP Pipeline demonstrating Serial Evolution and the Resurrection Protocol.

📊 Shocking Findings from 15 SOTA Models

We evaluated 15 models (including Opus-4.7, GPT-5.5, DeepSeek-v4-Pro, and Qwen-3.6-Max). The results were eye-opening:

  • A Clear Capability Ceiling: None of the 15 evaluated models successfully completed the entire pipeline. Even the top-performing model, Opus-4.7, stalled at the IR Generation stage.
  • The 2525x Efficiency Gap: Process efficiency varied wildly. Total inference costs ranged from $0.05 (Qwen3-Coder) to $126.24 (Opus-4.7) — an extreme 2525x difference.
  • The "Context" Killer: We mapped a detailed failure taxonomy and discovered that Context Failure is the most prevalent hard-stop reason (accounting for 60.0% of failures), predominantly occurring in the middle stages (T2-T3) as history and code artifacts accumulate.
Figure 5: Trade-off of cost and performance

Figure 5: Trade-off of Cost and Performance: Elapsed time and API cost versus mean reward.

⚖️ Beyond Accuracy: The Agent Efficiency Index (AEI)

In production, a model that brute-forces a solution using massive context and time isn't always the best choice. We propose the Agent Efficiency Index (AEI), a composite metric jointly measuring task effectiveness, time, cost, and token utilization.

Under AEI, the rankings flip: GPT-5.5 achieved the highest composite efficiency (AEI 81.57), whereas Opus-4.7, despite having the highest raw task reward, dropped to an AEI of 40.00 due to massive resource overhead.

Read the full paper to explore the deep diagnostics of model behavior and why we need to move past static benchmarks!

The workload is based on YatCC compiler-construction pipeline, which is a widely-used engineering practice project. Beyond that, the YatCC Platform provides more intelligent services.
More Infomation: https://yatcc-ai.com

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.27492
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27492 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.27492 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27492 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers