News / #voice Tag Voice 365 articles archived under #voice · RSS Sign in to follow arXiv — NLP / Computation & Language research 6d ago Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English arXiv:2606.23948v1 Announce Type: new Abstract: Self-supervised and supervised speech models are increasingly used to investigate which linguistic information their internal representations encode, and at what level of abstraction they encode it. One underexplored phenomenon is… 6 arXiv — NLP / Computation & Language research 6d ago Automatic Part-of-Speech Tagging of Arabic-English Dictionary Senses through WordNet arXiv:2606.24359v1 Announce Type: new Abstract: This paper proposed an algorithm for part-of-speech (POS) tagging senses of a bilingual dictionary. The algorithm is applied on the Al-Mawrid Arabic-English dictionary. The tagging task is accomplished by transferring the POS tags… 21 arXiv — NLP / Computation & Language research 6d ago Measuring User's Mental Models of Speech Translation in Human-AI Collaboration arXiv:2606.24644v1 Announce Type: new Abstract: Millions of people use machine translation (MT) tools daily, yet little is known about their perception of what systems can and cannot do. This paper studies users' mental models of speech translation systems through a new… 13 arXiv — NLP / Computation & Language research 6d ago CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation arXiv:2606.24714v1 Announce Type: new Abstract: Chinese news text contains dense written forms such as scores, hyphenated model names, ranges, unit symbols, percentages, English abbreviations, and mixed Chinese-Latin-digit names. These forms are frequent in real listening… 33 arXiv — NLP / Computation & Language research 6d ago L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models arXiv:2606.24825v1 Announce Type: new Abstract: Part-of-Speech (POS) tagging is a foundational NLP task underpinning machine translation, information extraction, and syntactic parsing. Despite Marathi being spoken by over 83 million people and ranking among the top twenty most… 16 arXiv — NLP / Computation & Language research 6d ago Progressive Alignment Objectives for Aligner-Encoder based ASR arXiv:2606.24147v1 Announce Type: cross Abstract: Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without… 23 arXiv — NLP / Computation & Language research 6d ago ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained… 15 Hugging Face official-blog 6d ago Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World Back to Articles a]:hidden"> Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World Published June 24, 2026 Update on GitHub Upvote 2 Daniel Gert Nielsen daniel-treble treble-technologies Shivam Saini whojavumusic treble-technologies Alessia Milo alessia-treble… 11 r/LocalLLaMA community 6d ago CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample Ran three open-weight TTS models head to head on CPU. Intel Xeon, 4 cores, 15.6GB RAM, no GPU. Five configs, six text lengths from 12 to 1712 chars, 5 timed reps per cell after warmup, 150 timed runs total. Every audio output scored with UTMOS (utmos22_strong) so quality isn't… 19 llama.cpp releases dev-tools 6d ago b9768 model: Granite Speech Plus ( #24818 ) feat: Add conversion support for Granite Speech Plus Branch: GraniteSpeechPlus AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart [email protected] feat: Extend granite_speech to support plus multi-layer concatenation… 27 r/MachineLearning community 7d ago Recommendations for speech annotation tools [D] I'm looking for human-in-the-loop platforms that allow you to automatically transcribe audio followed by manually fixing the transcriptions and fine tuning the model. Is there a local (not an online service) installable platform for doing this?   submitted by  … 11 r/MachineLearning community 8d ago Best current methods for finetuning whisper on domain specific vocabulary? [P] Hey everyone, I’m wondering whether there are any newer or more effective methods for fine tuning whisper on domain specific speech. I’m working on a project where the model needs to reliably detect certain specific words and technical terms. The vocabulary and context are… 4 Hacker News — AI on Front Page community 10d ago A new bill takes aim at government pressure to silence lawful online speech Article URL: https://www.eff.org/deeplinks/2026/06/new-bill-takes-aim-government-pressure-silence-lawful-online-speech Comments URL: https://news.ycombinator.com/item?id=48600950 Points: 205 # Comments: 111 27 r/LocalLLaMA community 10d ago How do you guys setup search with your AI models? Been selfhosting my models for a while and I'd really like to integrate Gemma 4 12B as a simple voice assistant with search capabilities. I've tried using openwebui but the search is kind of broken with DDG and I really don't want to use API keys from Brave or Google etc. So… 25 r/LocalLLaMA community 10d ago Watching a local AI voice assistant get dumber (A 9B to 0.8B agent experiment on my RTX 5060 Ti) I wanted to find the exact floor for running an intelligent, local voice assistant agent on consumer hardware. I kept the environment, tools, and prompts identical, I stepped the model sizes down through Qwen 3.5 9B, 4B, 2B, and 0.8B to see how agentic reasoning degrades. The… 12 Hugging Face Daily Papers research 10d ago Duration Aware Scheduling for ASR Serving Under Workload Drift Abstract Duration-aware scheduling policies improve ASR serving latency by leveraging audio length as a predictor for processing time, with SJF and HRRN algorithms showing significant median latency reductions while maintaining throughput. Generated by… 26 arXiv — Machine Learning research 11d ago IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows arXiv:2606.19595v1 Announce Type: new Abstract: Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for… 35 arXiv — NLP / Computation & Language research 11d ago Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling arXiv:2606.19354v1 Announce Type: new Abstract: Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the… 5 arXiv — NLP / Computation & Language research 11d ago A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization arXiv:2606.19591v1 Announce Type: new Abstract: In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to… 9 arXiv — NLP / Computation & Language research 11d ago Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal arXiv:2606.19910v1 Announce Type: new Abstract: Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised… 31 arXiv — NLP / Computation & Language research 11d ago ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion arXiv:2606.20179v1 Announce Type: new Abstract: Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial… 21 arXiv — NLP / Computation & Language research 11d ago CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges arXiv:2606.20369v1 Announce Type: new Abstract: Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats,… 5 arXiv — NLP / Computation & Language research 11d ago Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations arXiv:2606.19951v1 Announce Type: cross Abstract: Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via… 34 arXiv — NLP / Computation & Language research 11d ago Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning arXiv:2606.19996v1 Announce Type: cross Abstract: \noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset… 25 arXiv — NLP / Computation & Language research 11d ago PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors arXiv:2606.20137v1 Announce Type: cross Abstract: Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA),… 35 arXiv — NLP / Computation & Language research 11d ago Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech arXiv:2603.16606v3 Announce Type: replace Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual… 7 r/LocalLLaMA community 11d ago My suitcase robot gets high now off a real gas sensor wired straight into the LLM sampler. Smoke raises temperature/top_p/top_k live, so his speech genuinely gets loopier and never repeats. Follow-up on Sparky, my offline suitcase robot I keep overdeveloping. He gets high now, and there's no scripted "stoned mode" anywhere in it. A real MQ-2 gas sensor sits in the case. Every 0.5s I read it against an adaptive clean-air baseline and turn a smoke hit into a 0 to 10… 30 r/MachineLearning community 11d ago Voice debugging at the conversation level seems far more useful than isolated benchmark metrics [D] I have been thinking a lot about how poorly isolated benchmark metrics capture real conversational system quality once models are deployed into multi-turn environments. You can have strong STT scores, decent latency, high task completion rates, and still end up with… 25 arXiv — NLP / Computation & Language research 12d ago Fair Cognitive Impairment Detection Through Unlearning arXiv:2606.18571v1 Announce Type: cross Abstract: Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned… 33 arXiv — NLP / Computation & Language research 12d ago Continuous Audio Thinking for Large Audio Language Models arXiv:2606.18273v1 Announce Type: new Abstract: Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned… 37 arXiv — NLP / Computation & Language research 12d ago Montreal Forced Aligner and the state of speech-to-text alignment in 2026 arXiv:2606.18466v1 Announce Type: new Abstract: The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded… 5 arXiv — NLP / Computation & Language research 12d ago Speech-Driven End-to-End Language Discrimination towards Chinese Dialects arXiv:2606.18584v1 Announce Type: new Abstract: Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task. The traditional text-driven focus leads to poor results. In this paper, we explore the effectiveness of… 12 arXiv — NLP / Computation & Language research 12d ago Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining arXiv:2606.18852v1 Announce Type: new Abstract: Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface… 35 arXiv — NLP / Computation & Language research 12d ago Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies arXiv:2606.18264v1 Announce Type: cross Abstract: Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors… 8 arXiv — NLP / Computation & Language research 12d ago Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment arXiv:2606.18979v1 Announce Type: cross Abstract: Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but… 36 arXiv — NLP / Computation & Language research 12d ago IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages arXiv:2606.19157v1 Announce Type: cross Abstract: AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge… 35 arXiv — NLP / Computation & Language research 12d ago Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech arXiv:2506.12311v3 Announce Type: replace Abstract: Text-to-speech (TTS) for Modern Hebrew is challenged by the language's orthographic complexity, with existing solutions ignoring underspecified phonetic features such as stress. We present a framework for more phonetically… 38 arXiv — NLP / Computation & Language research 12d ago TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving arXiv:2508.07375v3 Announce Type: replace Abstract: Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and… 20 arXiv — NLP / Computation & Language research 12d ago UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition arXiv:2509.14653v2 Announce Type: replace Abstract: This paper proposes a unimodal aggregation (UMA) based nonautoregressive model for both English and Mandarin speech recognition. The original UMA explicitly segments and aggregates acoustic frames (with unimodal weights that… 28 MIT News — AI research 12d ago MIT in the media: For the future of tech, "Massachusetts can absolutely lead" Leaders, faculty across MIT discuss fostering innovation and talent in Greater Boston in special series of articles published alongside the outlet's annual list of 'Tech Power Players' 27 r/LocalLLaMA community 12d ago I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. I’ve been experimenting with how small a usable neural TTS model can realistically get, and I just released Inflect-Nano-v1 . As far as I researched (though I could be wrong on this), Inflect-Nano-v1 is the #2 smallest TTS model publicly released (after TinyTTS) , and it… 24 arXiv — NLP / Computation & Language research 13d ago MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task arXiv:2606.17255v1 Announce Type: new Abstract: This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create… 20 arXiv — NLP / Computation & Language research 13d ago Are you speaking my languages? On spoken language adherence in multimodal LLMs arXiv:2606.17281v1 Announce Type: new Abstract: While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To… 9 arXiv — NLP / Computation & Language research 13d ago Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation arXiv:2606.17820v1 Announce Type: new Abstract: This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of… 28 arXiv — NLP / Computation & Language research 13d ago When Multiple Scripts Matter: Evaluating ASR in Clinical Settings arXiv:2606.17826v1 Announce Type: new Abstract: Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics… 20 arXiv — NLP / Computation & Language research 13d ago Perceptual compensation for tonal context in self-supervised speech models arXiv:2606.17835v1 Announce Type: new Abstract: This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones,… 30 arXiv — NLP / Computation & Language research 13d ago Learning task-specific subspaces via interventional post-training of speech foundation models arXiv:2606.17967v1 Announce Type: new Abstract: Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech… 5 arXiv — NLP / Computation & Language research 13d ago SpeechDx: A Multi-Task Benchmark for Clinical Speech AI arXiv:2606.17339v1 Announce Type: cross Abstract: Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated… 15 arXiv — NLP / Computation & Language research 13d ago Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition arXiv:2606.17537v1 Announce Type: cross Abstract: Non-autoregressive (NAR) decoding generates output tokens in parallel, making speech recognition faster than autoregressive decoding, which generates them sequentially from left to right. However, the recognition performance is… 30 arXiv — NLP / Computation & Language research 13d ago ALAS: An Automatic Latent Alignment Score for Audio Language Models arXiv:2505.19937v3 Announce Type: replace Abstract: Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion… 17 Page 2 of 8 · 365 articles ← Newer Older →