Hugging Face Daily Papers · · 5 min read

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Automatic speech recognition (ASR) is a core component of human-computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S2ER than in conventional token-level metrics. Human-AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework.</p>\n","updatedAt":"2026-06-08T11:57:50.435Z","author":{"_id":"67b3f529d21021f9eb29fa36","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b3f529d21021f9eb29fa36/2bJpoGozgLCUs6VmhZqXv.jpeg","fullname":"Zixuan Jiang","name":"Andrew0425","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.89033442735672},"editors":["Andrew0425"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67b3f529d21021f9eb29fa36/2bJpoGozgLCUs6VmhZqXv.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29430","authors":[{"_id":"6a26adfbe4c258a029492380","name":"Zixuan Jiang","hidden":false},{"_id":"6a26adfbe4c258a029492381","name":"Yanqiao Zhu","hidden":false},{"_id":"6a26adfbe4c258a029492382","name":"Peng Wang","hidden":false},{"_id":"6a26adfbe4c258a029492383","name":"Qinyuan Chen","hidden":false},{"_id":"6a26adfbe4c258a029492384","name":"Xinjian Zhao","hidden":false},{"_id":"6a26adfbe4c258a029492385","name":"Xipeng Qiu","hidden":false},{"_id":"6a26adfbe4c258a029492386","name":"Wupeng Wang","hidden":false},{"_id":"6a26adfbe4c258a029492387","name":"Zhifu Gao","hidden":false},{"_id":"6a26adfbe4c258a029492388","name":"Xiangang Li","hidden":false},{"_id":"6a26adfbe4c258a029492389","name":"Kai Yu","hidden":false},{"_id":"6a26adfbe4c258a02949238a","name":"Xie Chen","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation","submittedOnDailyBy":{"_id":"67b3f529d21021f9eb29fa36","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b3f529d21021f9eb29fa36/2bJpoGozgLCUs6VmhZqXv.jpeg","isPro":false,"fullname":"Zixuan Jiang","user":"Andrew0425","type":"user","name":"Andrew0425"},"summary":"Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/","upvotes":1,"discussionId":"6a26adfbe4c258a02949238b","projectPage":"https://interactiveasr.github.io/","githubRepo":"https://github.com/InteractiveASR/AgenticASR","githubRepoAddedBy":"user","ai_summary":"Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system.","ai_keywords":["automatic speech recognition","multi-turn refinement","semantic correction","intent routing","reasoning-based editing","sentence-level semantic error rate","interactive simulation system","multilingual","named-entity-intensive","code-switching"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"656046e9d2cf49994abfb7f8","name":"X-LANCE","fullname":"SJTU Cross Media Language Intelligence Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a6217e063f497473d7cab8/rJNiVNvlfd7S6IFO87TxM.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67b3f529d21021f9eb29fa36","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b3f529d21021f9eb29fa36/2bJpoGozgLCUs6VmhZqXv.jpeg","isPro":false,"fullname":"Zixuan Jiang","user":"Andrew0425","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"656046e9d2cf49994abfb7f8","name":"X-LANCE","fullname":"SJTU Cross Media Language Intelligence Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a6217e063f497473d7cab8/rJNiVNvlfd7S6IFO87TxM.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29430.md"}">
Papers
arxiv:2605.29430

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Published on May 28
· Submitted by
Zixuan Jiang
on Jun 8
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system.

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

Community

Paper submitter about 8 hours ago

Automatic speech recognition (ASR) is a core component of human-computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S2ER than in conventional token-level metrics. Human-AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.29430
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.29430 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.29430 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.29430 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers