Automatic speech recognition (ASR) is a core component of human-computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S2ER than in conventional token-level metrics. Human-AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework.</p>\n","updatedAt":"2026-06-08T11:57:50.435Z","author":{"_id":"67b3f529d21021f9eb29fa36","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b3f529d21021f9eb29fa36/2bJpoGozgLCUs6VmhZqXv.jpeg","fullname":"Zixuan Jiang","name":"Andrew0425","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.89033442735672},"editors":["Andrew0425"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67b3f529d21021f9eb29fa36/2bJpoGozgLCUs6VmhZqXv.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29430","authors":[{"_id":"6a26adfbe4c258a029492380","name":"Zixuan Jiang","hidden":false},{"_id":"6a26adfbe4c258a029492381","name":"Yanqiao Zhu","hidden":false},{"_id":"6a26adfbe4c258a029492382","name":"Peng Wang","hidden":false},{"_id":"6a26adfbe4c258a029492383","name":"Qinyuan Chen","hidden":false},{"_id":"6a26adfbe4c258a029492384","name":"Xinjian Zhao","hidden":false},{"_id":"6a26adfbe4c258a029492385","name":"Xipeng Qiu","hidden":false},{"_id":"6a26adfbe4c258a029492386","name":"Wupeng Wang","hidden":false},{"_id":"6a26adfbe4c258a029492387","name":"Zhifu Gao","hidden":false},{"_id":"6a26adfbe4c258a029492388","name":"Xiangang Li","hidden":false},{"_id":"6a26adfbe4c258a029492389","name":"Kai Yu","hidden":false},{"_id":"6a26adfbe4c258a02949238a","name":"Xie Chen","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation","submittedOnDailyBy":{"_id":"67b3f529d21021f9eb29fa36","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b3f529d21021f9eb29fa36/2bJpoGozgLCUs6VmhZqXv.jpeg","isPro":false,"fullname":"Zixuan Jiang","user":"Andrew0425","type":"user","name":"Andrew0425"},"summary":"Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/","upvotes":1,"discussionId":"6a26adfbe4c258a02949238b","projectPage":"https://interactiveasr.github.io/","githubRepo":"https://github.com/InteractiveASR/AgenticASR","githubRepoAddedBy":"user","ai_summary":"Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system.","ai_keywords":["automatic speech recognition","multi-turn refinement","semantic correction","intent routing","reasoning-based editing","sentence-level semantic error rate","interactive simulation system","multilingual","named-entity-intensive","code-switching"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"656046e9d2cf49994abfb7f8","name":"X-LANCE","fullname":"SJTU Cross Media Language Intelligence Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a6217e063f497473d7cab8/rJNiVNvlfd7S6IFO87TxM.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67b3f529d21021f9eb29fa36","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b3f529d21021f9eb29fa36/2bJpoGozgLCUs6VmhZqXv.jpeg","isPro":false,"fullname":"Zixuan Jiang","user":"Andrew0425","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"656046e9d2cf49994abfb7f8","name":"X-LANCE","fullname":"SJTU Cross Media Language Intelligence Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a6217e063f497473d7cab8/rJNiVNvlfd7S6IFO87TxM.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29430.md"}">
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system.
Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/
Community
Automatic speech recognition (ASR) is a core component of human-computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S2ER than in conventional token-level metrics. Human-AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.29430 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.29430 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.29430 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.