Hugging Face Daily Papers · May 14, 2026 · 5 min read

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

How do you know a voice agent is good? Task completion isn't enough. A voice agent can call the correct tools and still misread a confirmation code, fabricate a policy detail, or respond so slowly a caller hangs up. Catching those failures requires evaluation that goes beyond transcripts — and beyond a single domain or acoustic condition. Today, we're releasing 𝗘𝗩𝗔-𝗕𝗲𝗻𝗰𝗵 — designed to surface exactly that. 🏢 𝗧𝗵𝗿𝗲𝗲 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝗱𝗼𝗺𝗮𝗶𝗻𝘀. We've scaled from a single dataset to three: 𝗛𝗥, 𝗜𝗧𝗦𝗠, and 𝗖𝗦𝗠. Because the best voice agent for customer service isn't necessarily the best one for HR or IT support.\n","updatedAt":"2026-05-14T17:16:03.243Z","author":{"_id":"64820d2bd8662b0714a2a3cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64820d2bd8662b0714a2a3cd/AHp4bGT05PNlaIGue5gDw.png","fullname":"Orlando Marquez","name":"marquezo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9007378816604614},"editors":["marquezo"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64820d2bd8662b0714a2a3cd/AHp4bGT05PNlaIGue5gDw.png"],"reactions":[{"reaction":"❤️","users":["tarabogavelli"],"count":1}],"isReport":false}},{"id":"6a06072fa1056d5ca9a198fc","author":{"_id":"64820d2bd8662b0714a2a3cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64820d2bd8662b0714a2a3cd/AHp4bGT05PNlaIGue5gDw.png","fullname":"Orlando Marquez","name":"marquezo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-05-14T17:32:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"If you prefer the video/audio modality, please checkout the podcast about this work: https://www.youtube.com/watch?v=x7Ks932T18o","html":"If you prefer the video/audio modality, please checkout the podcast about this work: <a href=\"https://www.youtube.com/watch?v=x7Ks932T18o\" rel=\"nofollow\">https://www.youtube.com/watch?v=x7Ks932T18o</a>\n","updatedAt":"2026-05-14T17:32:31.185Z","author":{"_id":"64820d2bd8662b0714a2a3cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64820d2bd8662b0714a2a3cd/AHp4bGT05PNlaIGue5gDw.png","fullname":"Orlando Marquez","name":"marquezo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6542163491249084},"editors":["marquezo"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64820d2bd8662b0714a2a3cd/AHp4bGT05PNlaIGue5gDw.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.13841","authors":[{"_id":"6a05335db1a8cbabc9f086dc","name":"Tara Bogavelli","hidden":false},{"_id":"6a05335db1a8cbabc9f086dd","name":"Gabrielle Gauthier Melançon","hidden":false},{"_id":"6a05335db1a8cbabc9f086de","name":"Katrina Stankiewicz","hidden":false},{"_id":"6a05335db1a8cbabc9f086df","name":"Oluwanifemi Bamgbose","hidden":false},{"_id":"6a05335db1a8cbabc9f086e0","name":"Fanny Riols","hidden":false},{"_id":"6a05335db1a8cbabc9f086e1","name":"Hoang H. Nguyen","hidden":false},{"_id":"6a05335db1a8cbabc9f086e2","name":"Raghav Mehndiratta","hidden":false},{"_id":"6a05335db1a8cbabc9f086e3","name":"Lindsay Devon Brin","hidden":false},{"_id":"6a05335db1a8cbabc9f086e4","name":"Joseph Marinier","hidden":false},{"_id":"6a05335db1a8cbabc9f086e5","name":"Hari Subramani","hidden":false},{"_id":"6a05335db1a8cbabc9f086e6","name":"Anil Madamala","hidden":false},{"_id":"6a05335db1a8cbabc9f086e7","name":"Sridhar Krishna Nemala","hidden":false},{"_id":"6a05335db1a8cbabc9f086e8","name":"Srinivas Sunkara","hidden":false}],"publishedAt":"2026-05-13T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents","submittedOnDailyBy":{"_id":"64820d2bd8662b0714a2a3cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64820d2bd8662b0714a2a3cd/AHp4bGT05PNlaIGue5gDw.png","isPro":false,"fullname":"Orlando Marquez","user":"marquezo","type":"user","name":"marquezo"},"summary":"Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.","upvotes":56,"discussionId":"6a05335db1a8cbabc9f086e9","projectPage":"https://servicenow.github.io/eva/","githubRepo":"https://github.com/ServiceNow/eva","githubRepoAddedBy":"user","ai_summary":"EVA-Bench presents a comprehensive evaluation framework for voice agents that simulates realistic conversations and measures performance across multiple voice-specific failure modes using novel accuracy and experience metrics.","ai_keywords":["voice agents","bot-to-bot audio conversations","multi-turn dialogues","user simulator","automatic simulation validation","composite metrics","EVA-A","EVA-X","task completion","speech fidelity","conversation progression","turn-taking timing","agent architectures","controlled perturbation suite","accent robustness","noise robustness","pass@1","pass@k","pass^k"],"githubStars":114,"organization":{"_id":"65f4df5de83b55da5d79fbb6","name":"ServiceNow-AI","fullname":"ServiceNow-AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d3095c2727d7888cbb54e2/Uv-Lx8PVGviqokfOyYlCN.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66d0b470cc4d59dba5c70879","avatarUrl":"/avatars/bed6792ded21bf81ea054da43049e0fa.svg","isPro":false,"fullname":"Tara Bogavelli","user":"tarabogavelli","type":"user"},{"_id":"661ee8ae12bb506e4e54bb50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ee8ae12bb506e4e54bb50/dzk5nufiA_EXmwY1DzJ9L.jpeg","isPro":false,"fullname":"Katrina Stankiewicz","user":"kstankiewicz","type":"user"},{"_id":"607f060442beb4da0f990182","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/607f060442beb4da0f990182/j5W2tLyU6JqkaTf3kv66s.jpeg","isPro":false,"fullname":"Patrice Bechard","user":"patricebechard","type":"user"},{"_id":"6493474f0b9311a2b3f1fb0b","avatarUrl":"/avatars/60743edcbb97b0096a56527d7778bfd9.svg","isPro":false,"fullname":"Berk Bekiroglu","user":"bbekirog","type":"user"},{"_id":"666a51b04a6c703a09ce055f","avatarUrl":"/avatars/db5a6f268a302b2cc866bf39dea3363c.svg","isPro":false,"fullname":"Hoang Nguyen","user":"hnguy7","type":"user"},{"_id":"65ea6558b985ae912c57b294","avatarUrl":"/avatars/54eeba9ce8cb5bf35edfa8fad76a813d.svg","isPro":false,"fullname":"Denis Akhiyarov","user":"dtanow","type":"user"},{"_id":"64a58f67aee3750f265d3f84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a58f67aee3750f265d3f84/LXLEQ-eamQExzQ-1o4bze.jpeg","isPro":false,"fullname":"Jishnu Nair","user":"jishnunair","type":"user"},{"_id":"64254b6ad3e6fdf87e45011a","avatarUrl":"/avatars/f84c77b2777aa4f7be84fd61a3962a29.svg","isPro":false,"fullname":"Shikha Singhal","user":"singhalshikha518","type":"user"},{"_id":"642f99079b2484d7d857341b","avatarUrl":"/avatars/01965cc5a5dbe9c08025a51973462a6a.svg","isPro":false,"fullname":"Sai Rajeswar","user":"rajeswarsai","type":"user"},{"_id":"603c6bf03249b99991dbcbd0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/603c6bf03249b99991dbcbd0/IZDS9PIg-9b8QleK08Zan.png","isPro":false,"fullname":"Surajit Dasgupta","user":"surajit","type":"user"},{"_id":"64748ce06d4dda6f7c6b3a0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64748ce06d4dda6f7c6b3a0a/qm2hIK3ygK1iywQ3fTK0M.jpeg","isPro":false,"fullname":"Aakash Umeshbhai Bhagat","user":"aakashumeshbhaibhagat","type":"user"},{"_id":"69c686226875cba2d5ca46ba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/spt8lL1L1yQkTGOagDxgH.png","isPro":false,"fullname":"Lindsay Brin","user":"lindsaybrin","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"65f4df5de83b55da5d79fbb6","name":"ServiceNow-AI","fullname":"ServiceNow-AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d3095c2727d7888cbb54e2/Uv-Lx8PVGviqokfOyYlCN.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.13841.md"}">

Papers

arxiv:2605.13841

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Published on May 13

· Submitted by

Orlando Marquez on May 14

ServiceNow-AI

Upvote

Authors:

Abstract

EVA-Bench presents a comprehensive evaluation framework for voice agents that simulates realistic conversations and measures performance across multiple voice-specific failure modes using novel accuracy and experience metrics.

AI-generated summary

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.