News / #voice Tag Voice 365 articles archived under #voice · RSS Sign in to follow arXiv — NLP / Computation & Language research 30m ago Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain arXiv:2606.28772v1 Announce Type: new Abstract: Hate speech annotation pipelines routinely collapse annotator disagreement into majority vote labels before training. We show that this aggregation is not neutral: 42.6% of all annotator disagreement in HateXplain concentrates… 28 arXiv — NLP / Computation & Language research 30m ago How to Leverage Synthetic Speech for LLM-Based ASR Systems? arXiv:2606.29031v1 Announce Type: new Abstract: In regulated domains such as banking and healthcare, where privacy constraints make real speech costly to collect and retain, synthetic speech from modern text-to-speech (TTS) is an appealing alternative for training automatic… 15 arXiv — NLP / Computation & Language research 30m ago Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs arXiv:2606.29534v1 Announce Type: new Abstract: Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a… 23 r/MachineLearning community 12h ago I'm trying to implement CALM paper, and I have some questions. [P] Hello, I'm trying to implement the Pocket TTS by kyutai-labs represented by this paper . Since they have didn't released the training/fine-tuning code. I'm trying to implement it on my own for learning some stuff. I have read the paper, tried to implement it with much more… 34 Vercel — AI dev-tools 21h ago Build realtime voice agents on AI Gateway AI Gateway now supports audio/voice. You can add realtime voice, text to speech, and speech to text with the same calls you already use for text, image, and video, routed through AI Gateway alongside every other modality. Audio launches with models from OpenAI and xAI . Each… 26 arXiv — Machine Learning research 1d ago HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models arXiv:2606.27627v1 Announce Type: new Abstract: Discrete audio representations have become increasingly popular for building multimodal text-audio systems and integrating audio capabilities into Large Language Models (LLMs). However, numerous studies report performance… 7 arXiv — Machine Learning research 1d ago What Was That Again? Certified Robustness for Automatic Speech Recognition arXiv:2606.27698v1 Announce Type: new Abstract: Automatic Speech Recognition systems are notoriously both sensitive to adversarial and benign perturbations. While this has been repeatedly demonstrated using reference datasets, detecting such behaviors in deployed systems is… 19 arXiv — Machine Learning research 1d ago Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition arXiv:2606.27536v1 Announce Type: cross Abstract: Speech emotion recognition (SER) often relies on hard consensus labels that collapse annotator disagreement. We study distribution-based supervision for 9-class SER on MSP-Podcast 2.0 using a WavLM-Base multitask model for… 23 arXiv — Machine Learning research 1d ago Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings arXiv:2606.27543v1 Announce Type: cross Abstract: The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is… 37 arXiv — NLP / Computation & Language research 1d ago A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges arXiv:2606.27380v1 Announce Type: new Abstract: Automated coaching for oral presentations sits at the intersection of computer-assisted pronunciation training (CAPT), prosody modeling, and speech synthesis, yet no prior work has systematically surveyed and compared existing… 6 arXiv — NLP / Computation & Language research 1d ago Do Speech Emphasis Models Generalize across Languages and Emotions? arXiv:2606.27717v1 Announce Type: new Abstract: Prosodic emphasis varies across languages, emotions, and speaking styles, yet existing emphasis detection models are largely trained and evaluated on monolingual neutral read speech. We introduce MMEE (Multilingual Multi-Emotion… 12 arXiv — NLP / Computation & Language research 1d ago From Black-Box to Clinical Insight: A Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection arXiv:2606.27973v1 Announce Type: new Abstract: Speech-based cognitive impairment detection offers a noninvasive, accessible alternative to costly biomarker assays, yet transformer-based models remain clinically uninterpretable. We propose a multi-stage explainability framework… 23 arXiv — NLP / Computation & Language research 1d ago HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech arXiv:2606.28249v1 Announce Type: cross Abstract: Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting… 20 arXiv — NLP / Computation & Language research 1d ago Measuring the Redundancy of Decoder Layers in SpeechLLMs arXiv:2603.05121v2 Announce Type: replace Abstract: Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks.… 36 Hacker News — AI on Front Page community 1d ago Age verification is just a precursor to automated attribution of speech Article URL: https://nonogra.ph/age-verification-is-just-a-precursor-to-attribution-of-speech-06-29-2026 Comments URL: https://news.ycombinator.com/item?id=48714529 Points: 238 # Comments: 105 34 Vercel — AI dev-tools 1d ago Realtime voice, speech, and transcription now supported on AI Gateway AI Gateway now supports voice and audio models. You can build realtime voice agents, generate speech from text, and transcribe audio to text. This provides the same observability, spend controls, and bring-your-own-key support as text, image, and video models in AI Gateway, with… 17 Vercel — AI dev-tools 1d ago xAI Grok audio models now available on Vercel AI Gateway xAI's audio models are now live on AI Gateway. Realtime voice, text to speech, and speech to text are all available through the AI SDK with the same routing, observability, and spend controls as your other models. These capabilities are available on the AI SDK 7 release.… 11 Hacker News — AI on Front Page community 1d ago 30-year sentence for transporting zines is a five-alarm fire for free speech Article URL: https://theintercept.com/2026/06/26/daniel-sanchez-estrada-zines-prairieland-free-speech/ Comments URL: https://news.ycombinator.com/item?id=48711981 Points: 200 # Comments: 111 20 r/LocalLLaMA community 1d ago Whisperian: It is one of the best applications for Android, if you want to use Mic with some local ASR models. And it is also available on Play Store.   submitted by   /u/9r4n4y [link]   [comments] 29 r/MachineLearning community 2d ago NagaTranslate: Building a translation and voice pipeline for low-resource Nagaland creoles (Whisper, VITS, LLMs) [P] Hello r/MachineLearning , I wanted to share the architecture and challenges behind a project I’ve been building called NagaTranslate . The goal is to build a translation and speech pipeline for the low-resource languages of Nagaland, India (currently supporting Nagamese, Ao, and… 30 r/LocalLLaMA community 2d ago Agentic Cyberdeck Dev I developed this around August '25, but never had real polished panels. So, here we are with some decent panels, and new speakers for voice Al inferencing. This has local agentic GPS, chat, voice, vision analysis. This is a fun little project that I come back around to until I… 12 r/LocalLLaMA community 2d ago Are there any qwen finetunes that were genuinely stronger than the base? It's pretty popular to finetune qwen models but I never hear anyone say anything positive about them.   submitted by   /u/MrMrsPotts [link]   [comments] 30 r/LocalLLaMA community 3d ago Streaming medical STT running locally on a MacBook Quick teaser of what I’ve been working on over the last few weeks: a streaming medical speech-to-text model that runs fully on-device. This demo is running locally on a MacBook through MLX. Still doing more evals, but planning to release the open weights next week.  … 22 arXiv — NLP / Computation & Language research 4d ago Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars arXiv:2606.26107v1 Announce Type: new Abstract: Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a… 37 arXiv — NLP / Computation & Language research 4d ago AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification arXiv:2606.26452v1 Announce Type: new Abstract: To minimize privacy concerns and inference latency on edge devices like smartphones, lightweight on-device models remain important for end-user applications. Many of these applications involve natural language classification, but… 31 arXiv — NLP / Computation & Language research 4d ago Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean arXiv:2606.26618v1 Announce Type: new Abstract: Large pretrained text-to-speech (TTS) models sound almost human for well-resourced languages, but much worse for languages that are rare in their training data. We study this quality gap for Khmer and Korean using VoxCPM2, a… 26 arXiv — NLP / Computation & Language research 4d ago FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following arXiv:2606.26819v1 Announce Type: new Abstract: This paper describes our submission to the IWSLT 2026 Instruction Following shared task. SpeechLLMs are developed for both short-form and long-form speech instruction following under constrained settings. For the short track,… 14 arXiv — NLP / Computation & Language research 4d ago SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages arXiv:2606.26901v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare context remains largely unknown. In this study, we first… 6 arXiv — NLP / Computation & Language research 4d ago RedVox: Safety and Fairness Gaps in Speech Models Across Languages arXiv:2606.26968v1 Announce Type: new Abstract: Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting… 35 arXiv — NLP / Computation & Language research 4d ago Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech arXiv:2606.26144v1 Announce Type: cross Abstract: Speaker diarization, the task of determining "who spoke when" in a multi-speaker recording, is a critical component in applications such as meeting transcription, accessibility tools, and multilingual information retrieval. While… 36 r/LocalLLaMA community 4d ago audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA I’ve been working on audio.cpp , a native C++ inference framework for audio models built on top of ggml. The framework currently has 25 model families, but I want to be precise about its state: 12 are released in the repo now and ready for normal use. I’m not counting anything… 24 Hugging Face Daily Papers research 4d ago Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach Abstract A novel speaker verification framework combines frozen self-supervised features with ECAPA-TDNN and MoE modules to improve identity verification across both speech and non-verbal vocalizations while maintaining speech performance. Generated by… 30 r/MachineLearning community 4d ago Looking for arXiv endorsement (eess.AS or cs.SD) [R] Hi, I'm an undergrad researcher looking for an arXiv endorsement to submit my first paper in the audio/speech processing domain (keyword spotting on microcontrollers). I've submitted to a peer-reviewed IEEE conference and am awaiting results, but want to get a preprint up in the… 26 r/LocalLLaMA community 4d ago Has anyone tried to hack into their own system using a local model? With all this talk about Mythos being able to hack into. US government systems, I was wondering if anyone has tried to get root on their own system using a local model?   submitted by   /u/MrMrsPotts [link]   [comments] 18 arXiv — NLP / Computation & Language research 5d ago Graph-Based Phonetic Error Correction of Noisy ASR arXiv:2606.24889v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems, despite low overall word error rates, produce residual lexical errors that disproportionately affect semantically critical tokens such as named entities, negations, and sentiment-bearing… 37 arXiv — NLP / Computation & Language research 5d ago Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction arXiv:2606.24915v1 Announce Type: new Abstract: End-to-end automatic speech recognition systems frequently hallucinate rare entities and domain-specific terms, especially in low-resource languages. While retrieval-augmented generation frameworks can mitigate these errors using… 18 arXiv — NLP / Computation & Language research 5d ago Probing in the Wild: A Case Study of Self-Supervised Speech Representations on Mandarin Sub-dialects with Unsupervised Articulatory Analysis arXiv:2606.25459v1 Announce Type: new Abstract: While self-supervised speech models have achieved strong performance across speech tasks, relatively little is known about how their internal phonetic representations behave under fine-grained dialect variation. Existing probing… 11 arXiv — NLP / Computation & Language research 5d ago How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring arXiv:2606.25487v1 Announce Type: new Abstract: Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat… 23 arXiv — NLP / Computation & Language research 5d ago SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models arXiv:2606.25990v1 Announce Type: new Abstract: As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations… 29 arXiv — NLP / Computation & Language research 5d ago Dziri Voicebot: An End-to-End Low-Resource Speech-to-Speech Conversational System for Algerian Dialect arXiv:2606.26003v1 Announce Type: new Abstract: Automatic speech and language technologies are still heavily biased toward high-resource languages, limiting their applicability to dialectal and low-resource settings such as Algerian Dialect. This language presents additional… 28 arXiv — NLP / Computation & Language research 5d ago Real-Time Voice AI Hears but Does Not Listen arXiv:2606.26083v1 Announce Type: new Abstract: Speech conveys information through both words and vocal delivery. We evaluate four leading production realtime voice systems-OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash-on… 34 arXiv — NLP / Computation & Language research 5d ago Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis arXiv:2606.25369v1 Announce Type: cross Abstract: While large language model (LLM)-based text-to-speech (TTS) systems have achieved high-quality speech synthesis, most existing systems focus on English and Chinese. Japanese, however, remains under-explored, and its unique… 36 arXiv — NLP / Computation & Language research 5d ago Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS arXiv:2606.25424v1 Announce Type: cross Abstract: Diffusion-based text-to-speech (TTS) models have achieved significant improvements in speech quality. However, modeling sharp prosodic transitions and rapid pitch variations in expressive speech remains challenging. Existing… 37 arXiv — NLP / Computation & Language research 5d ago Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models arXiv:2606.25436v1 Announce Type: cross Abstract: Dialogue systems based on large language models (LLMs) have advanced significantly in recent years. However, dialectal variation remains a major challenge, particularly for systems that process spoken input. LLM-based speech… 34 arXiv — NLP / Computation & Language research 5d ago Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs? arXiv:2606.25444v1 Announce Type: cross Abstract: Connecting a pre-trained speech encoder to a Large Language Model (LLM) is the standard architecture for building Speech LLMs. However, a structural misalignment exists between the encoder and the LLM. Unlike encoders based on… 23 arXiv — NLP / Computation & Language research 5d ago Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming arXiv:2606.25460v1 Announce Type: cross Abstract: Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, Forced… 24 arXiv — NLP / Computation & Language research 5d ago Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme-Based Analysis of Climate Discourse arXiv:2601.13317v2 Announce Type: replace Abstract: Climate discourse online shapes public understanding of climate change and informs political and policy debate, yet it unfolds across structurally different environments: paid advertising platforms host targeted,… 9 Hacker News — AI on Front Page community 5d ago Founding a company in Germany: €9600, 152 days and I still can't send an invoice Article URL: https://paolino.me/founding-a-company-in-germany/ Comments URL: https://news.ycombinator.com/item?id=48658718 Points: 282 # Comments: 334 10 r/LocalLLaMA community 5d ago llama.cpp updates - granite-speech-4.1-2b, LFM2.5-ColBERT/Embedding-350M, Vulkan backend related changes & Misc items Supported Models : granite-speech-4.1-2b-plus by 24818 LFM2.5-ColBERT-350M & LFM2.5-Embedding-350M by 24913 Vulkan : vulkan: link ggml-cpu when GGML_VULKAN_CHECK_RESULTS / RUN_TESTS are enabled #24444 vulkan: make mul_mm ALIGNED a spec constant #24689 vulkan: support CONV_3D… 27 arXiv — Machine Learning research 6d ago NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction arXiv:2606.24087v1 Announce Type: new Abstract: Reconstructing continuous speech from scalp electroencephalography (EEG) remains fundamentally challenging. EEG provides a weak, spatially diffuse, and highly variable measurement of distributed cortical activity, whereas speech is… 9 Page 1 of 8 · 365 articles Older →