Tag

Voice

365 articles archived under #voice · RSS

arXiv — NLP / Computation & Language research 30m ago

Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain

arXiv:2606.28772v1 Announce Type: new Abstract: Hate speech annotation pipelines routinely collapse annotator disagreement into majority vote labels before training. We show that this aggregation is not neutral: 42.6% of all annotator disagreement in HateXplain concentrates…

28
arXiv — NLP / Computation & Language research 30m ago

How to Leverage Synthetic Speech for LLM-Based ASR Systems?

arXiv:2606.29031v1 Announce Type: new Abstract: In regulated domains such as banking and healthcare, where privacy constraints make real speech costly to collect and retain, synthetic speech from modern text-to-speech (TTS) is an appealing alternative for training automatic…

15
arXiv — NLP / Computation & Language research 30m ago

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

arXiv:2606.29534v1 Announce Type: new Abstract: Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a…

23
r/MachineLearning community 12h ago

I'm trying to implement CALM paper, and I have some questions. [P]

Hello, I'm trying to implement the Pocket TTS by kyutai-labs represented by this paper . Since they have didn't released the training/fine-tuning code. I'm trying to implement it on my own for learning some stuff. I have read the paper, tried to implement it with much more…

34
Vercel — AI dev-tools 21h ago

Build realtime voice agents on AI Gateway

AI Gateway now supports audio/voice. You can add realtime voice, text to speech, and speech to text with the same calls you already use for text, image, and video, routed through AI Gateway alongside every other modality. Audio launches with models from OpenAI and xAI . Each…

26
arXiv — Machine Learning research 1d ago

HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models

arXiv:2606.27627v1 Announce Type: new Abstract: Discrete audio representations have become increasingly popular for building multimodal text-audio systems and integrating audio capabilities into Large Language Models (LLMs). However, numerous studies report performance…

7
arXiv — Machine Learning research 1d ago

What Was That Again? Certified Robustness for Automatic Speech Recognition

arXiv:2606.27698v1 Announce Type: new Abstract: Automatic Speech Recognition systems are notoriously both sensitive to adversarial and benign perturbations. While this has been repeatedly demonstrated using reference datasets, detecting such behaviors in deployed systems is…

19
arXiv — Machine Learning research 1d ago

Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition

arXiv:2606.27536v1 Announce Type: cross Abstract: Speech emotion recognition (SER) often relies on hard consensus labels that collapse annotator disagreement. We study distribution-based supervision for 9-class SER on MSP-Podcast 2.0 using a WavLM-Base multitask model for…

23
arXiv — Machine Learning research 1d ago

Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

arXiv:2606.27543v1 Announce Type: cross Abstract: The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is…

37
arXiv — NLP / Computation & Language research 1d ago

A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges

arXiv:2606.27380v1 Announce Type: new Abstract: Automated coaching for oral presentations sits at the intersection of computer-assisted pronunciation training (CAPT), prosody modeling, and speech synthesis, yet no prior work has systematically surveyed and compared existing…

6
arXiv — NLP / Computation & Language research 1d ago

Do Speech Emphasis Models Generalize across Languages and Emotions?

arXiv:2606.27717v1 Announce Type: new Abstract: Prosodic emphasis varies across languages, emotions, and speaking styles, yet existing emphasis detection models are largely trained and evaluated on monolingual neutral read speech. We introduce MMEE (Multilingual Multi-Emotion…

12
arXiv — NLP / Computation & Language research 1d ago

From Black-Box to Clinical Insight: A Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection

arXiv:2606.27973v1 Announce Type: new Abstract: Speech-based cognitive impairment detection offers a noninvasive, accessible alternative to costly biomarker assays, yet transformer-based models remain clinically uninterpretable. We propose a multi-stage explainability framework…

23
arXiv — NLP / Computation & Language research 1d ago

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

arXiv:2606.28249v1 Announce Type: cross Abstract: Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting…

20
arXiv — NLP / Computation & Language research 1d ago

Measuring the Redundancy of Decoder Layers in SpeechLLMs

arXiv:2603.05121v2 Announce Type: replace Abstract: Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks.…

36
Hacker News — AI on Front Page community 1d ago

Age verification is just a precursor to automated attribution of speech

Article URL: https://nonogra.ph/age-verification-is-just-a-precursor-to-attribution-of-speech-06-29-2026 Comments URL: https://news.ycombinator.com/item?id=48714529 Points: 238 # Comments: 105

34
Vercel — AI dev-tools 1d ago

Realtime voice, speech, and transcription now supported on AI Gateway

AI Gateway now supports voice and audio models. You can build realtime voice agents, generate speech from text, and transcribe audio to text. This provides the same observability, spend controls, and bring-your-own-key support as text, image, and video models in AI Gateway, with…

17
Vercel — AI dev-tools 1d ago

xAI Grok audio models now available on Vercel AI Gateway

xAI's audio models are now live on AI Gateway. Realtime voice, text to speech, and speech to text are all available through the AI SDK with the same routing, observability, and spend controls as your other models. These capabilities are available on the AI SDK 7 release.…

11
Hacker News — AI on Front Page community 1d ago

30-year sentence for transporting zines is a five-alarm fire for free speech

Article URL: https://theintercept.com/2026/06/26/daniel-sanchez-estrada-zines-prairieland-free-speech/ Comments URL: https://news.ycombinator.com/item?id=48711981 Points: 200 # Comments: 111

20
r/LocalLLaMA community 1d ago

Whisperian: It is one of the best applications for Android, if you want to use Mic with some local ASR models. And it is also available on Play Store.

  submitted by   /u/9r4n4y [link]   [comments]

29
r/MachineLearning community 2d ago

NagaTranslate: Building a translation and voice pipeline for low-resource Nagaland creoles (Whisper, VITS, LLMs) [P]

Hello r/MachineLearning , I wanted to share the architecture and challenges behind a project I’ve been building called NagaTranslate . The goal is to build a translation and speech pipeline for the low-resource languages of Nagaland, India (currently supporting Nagamese, Ao, and…

30
r/LocalLLaMA community 2d ago

Agentic Cyberdeck Dev

I developed this around August '25, but never had real polished panels. So, here we are with some decent panels, and new speakers for voice Al inferencing. This has local agentic GPS, chat, voice, vision analysis. This is a fun little project that I come back around to until I…

12
r/LocalLLaMA community 2d ago

Are there any qwen finetunes that were genuinely stronger than the base?

It's pretty popular to finetune qwen models but I never hear anyone say anything positive about them.   submitted by   /u/MrMrsPotts [link]   [comments]

30
r/LocalLLaMA community 3d ago

Streaming medical STT running locally on a MacBook

Quick teaser of what I’ve been working on over the last few weeks: a streaming medical speech-to-text model that runs fully on-device. This demo is running locally on a MacBook through MLX. Still doing more evals, but planning to release the open weights next week.  …

22
arXiv — NLP / Computation & Language research 4d ago

Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

arXiv:2606.26107v1 Announce Type: new Abstract: Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a…

37
arXiv — NLP / Computation & Language research 4d ago

AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification

arXiv:2606.26452v1 Announce Type: new Abstract: To minimize privacy concerns and inference latency on edge devices like smartphones, lightweight on-device models remain important for end-user applications. Many of these applications involve natural language classification, but…

31
arXiv — NLP / Computation & Language research 4d ago

Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean

arXiv:2606.26618v1 Announce Type: new Abstract: Large pretrained text-to-speech (TTS) models sound almost human for well-resourced languages, but much worse for languages that are rare in their training data. We study this quality gap for Khmer and Korean using VoxCPM2, a…

26
arXiv — NLP / Computation & Language research 4d ago

FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following

arXiv:2606.26819v1 Announce Type: new Abstract: This paper describes our submission to the IWSLT 2026 Instruction Following shared task. SpeechLLMs are developed for both short-form and long-form speech instruction following under constrained settings. For the short track,…

14
arXiv — NLP / Computation & Language research 4d ago

SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages

arXiv:2606.26901v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare context remains largely unknown. In this study, we first…

6
arXiv — NLP / Computation & Language research 4d ago

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

arXiv:2606.26968v1 Announce Type: new Abstract: Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting…

35
arXiv — NLP / Computation & Language research 4d ago

Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech

arXiv:2606.26144v1 Announce Type: cross Abstract: Speaker diarization, the task of determining "who spoke when" in a multi-speaker recording, is a critical component in applications such as meeting transcription, accessibility tools, and multilingual information retrieval. While…

36
r/LocalLLaMA community 4d ago

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

I’ve been working on audio.cpp , a native C++ inference framework for audio models built on top of ggml. The framework currently has 25 model families, but I want to be precise about its state: 12 are released in the repo now and ready for normal use. I’m not counting anything…

24
Hugging Face Daily Papers research 4d ago

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

Abstract A novel speaker verification framework combines frozen self-supervised features with ECAPA-TDNN and MoE modules to improve identity verification across both speech and non-verbal vocalizations while maintaining speech performance. Generated by…

30
r/MachineLearning community 4d ago

Looking for arXiv endorsement (eess.AS or cs.SD) [R]

Hi, I'm an undergrad researcher looking for an arXiv endorsement to submit my first paper in the audio/speech processing domain (keyword spotting on microcontrollers). I've submitted to a peer-reviewed IEEE conference and am awaiting results, but want to get a preprint up in the…

26
r/LocalLLaMA community 4d ago

Has anyone tried to hack into their own system using a local model?

With all this talk about Mythos being able to hack into. US government systems, I was wondering if anyone has tried to get root on their own system using a local model?   submitted by   /u/MrMrsPotts [link]   [comments]

18
arXiv — NLP / Computation & Language research 5d ago

Graph-Based Phonetic Error Correction of Noisy ASR

arXiv:2606.24889v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems, despite low overall word error rates, produce residual lexical errors that disproportionately affect semantically critical tokens such as named entities, negations, and sentiment-bearing…

37
arXiv — NLP / Computation & Language research 5d ago

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

arXiv:2606.24915v1 Announce Type: new Abstract: End-to-end automatic speech recognition systems frequently hallucinate rare entities and domain-specific terms, especially in low-resource languages. While retrieval-augmented generation frameworks can mitigate these errors using…

18
arXiv — NLP / Computation & Language research 5d ago

Probing in the Wild: A Case Study of Self-Supervised Speech Representations on Mandarin Sub-dialects with Unsupervised Articulatory Analysis

arXiv:2606.25459v1 Announce Type: new Abstract: While self-supervised speech models have achieved strong performance across speech tasks, relatively little is known about how their internal phonetic representations behave under fine-grained dialect variation. Existing probing…

11
arXiv — NLP / Computation & Language research 5d ago

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv:2606.25487v1 Announce Type: new Abstract: Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat…

23
arXiv — NLP / Computation & Language research 5d ago

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

arXiv:2606.25990v1 Announce Type: new Abstract: As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations…

29
arXiv — NLP / Computation & Language research 5d ago

Dziri Voicebot: An End-to-End Low-Resource Speech-to-Speech Conversational System for Algerian Dialect

arXiv:2606.26003v1 Announce Type: new Abstract: Automatic speech and language technologies are still heavily biased toward high-resource languages, limiting their applicability to dialectal and low-resource settings such as Algerian Dialect. This language presents additional…

28
arXiv — NLP / Computation & Language research 5d ago

Real-Time Voice AI Hears but Does Not Listen

arXiv:2606.26083v1 Announce Type: new Abstract: Speech conveys information through both words and vocal delivery. We evaluate four leading production realtime voice systems-OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash-on…

34
arXiv — NLP / Computation & Language research 5d ago

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

arXiv:2606.25369v1 Announce Type: cross Abstract: While large language model (LLM)-based text-to-speech (TTS) systems have achieved high-quality speech synthesis, most existing systems focus on English and Chinese. Japanese, however, remains under-explored, and its unique…

36
arXiv — NLP / Computation & Language research 5d ago

Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS

arXiv:2606.25424v1 Announce Type: cross Abstract: Diffusion-based text-to-speech (TTS) models have achieved significant improvements in speech quality. However, modeling sharp prosodic transitions and rapid pitch variations in expressive speech remains challenging. Existing…

37
arXiv — NLP / Computation & Language research 5d ago

Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models

arXiv:2606.25436v1 Announce Type: cross Abstract: Dialogue systems based on large language models (LLMs) have advanced significantly in recent years. However, dialectal variation remains a major challenge, particularly for systems that process spoken input. LLM-based speech…

34
arXiv — NLP / Computation & Language research 5d ago

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

arXiv:2606.25444v1 Announce Type: cross Abstract: Connecting a pre-trained speech encoder to a Large Language Model (LLM) is the standard architecture for building Speech LLMs. However, a structural misalignment exists between the encoder and the LLM. Unlike encoders based on…

23
arXiv — NLP / Computation & Language research 5d ago

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

arXiv:2606.25460v1 Announce Type: cross Abstract: Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, Forced…

24
arXiv — NLP / Computation & Language research 5d ago

Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme-Based Analysis of Climate Discourse

arXiv:2601.13317v2 Announce Type: replace Abstract: Climate discourse online shapes public understanding of climate change and informs political and policy debate, yet it unfolds across structurally different environments: paid advertising platforms host targeted,…

9
Hacker News — AI on Front Page community 5d ago

Founding a company in Germany: €9600, 152 days and I still can't send an invoice

Article URL: https://paolino.me/founding-a-company-in-germany/ Comments URL: https://news.ycombinator.com/item?id=48658718 Points: 282 # Comments: 334

10
r/LocalLLaMA community 5d ago

llama.cpp updates - granite-speech-4.1-2b, LFM2.5-ColBERT/Embedding-350M, Vulkan backend related changes & Misc items

Supported Models : granite-speech-4.1-2b-plus by 24818 LFM2.5-ColBERT-350M & LFM2.5-Embedding-350M by 24913 Vulkan : vulkan: link ggml-cpu when GGML_VULKAN_CHECK_RESULTS / RUN_TESTS are enabled #24444 vulkan: make mul_mm ALIGNED a spec constant #24689 vulkan: support CONV_3D…

27
arXiv — Machine Learning research 6d ago

NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction

arXiv:2606.24087v1 Announce Type: new Abstract: Reconstructing continuous speech from scalp electroencephalography (EEG) remains fundamentally challenging. EEG provides a weak, spatially diffuse, and highly variable measurement of distributed cortical activity, whereas speech is…

9

Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain

How to Leverage Synthetic Speech for LLM-Based ASR Systems?

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

I'm trying to implement CALM paper, and I have some questions. [P]

Build realtime voice agents on AI Gateway

HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models

What Was That Again? Certified Robustness for Automatic Speech Recognition

Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition

Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges

Do Speech Emphasis Models Generalize across Languages and Emotions?

From Black-Box to Clinical Insight: A Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

Measuring the Redundancy of Decoder Layers in SpeechLLMs

Age verification is just a precursor to automated attribution of speech

Realtime voice, speech, and transcription now supported on AI Gateway

xAI Grok audio models now available on Vercel AI Gateway

30-year sentence for transporting zines is a five-alarm fire for free speech

Whisperian: It is one of the best applications for Android, if you want to use Mic with some local ASR models. And it is also available on Play Store.

NagaTranslate: Building a translation and voice pipeline for low-resource Nagaland creoles (Whisper, VITS, LLMs) [P]

Agentic Cyberdeck Dev

Are there any qwen finetunes that were genuinely stronger than the base?

Streaming medical STT running locally on a MacBook

Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification

Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean

FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following

SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

Looking for arXiv endorsement (eess.AS or cs.SD) [R]

Has anyone tried to hack into their own system using a local model?

Graph-Based Phonetic Error Correction of Noisy ASR

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

Probing in the Wild: A Case Study of Self-Supervised Speech Representations on Mandarin Sub-dialects with Unsupervised Articulatory Analysis

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

Dziri Voicebot: An End-to-End Low-Resource Speech-to-Speech Conversational System for Algerian Dialect

Real-Time Voice AI Hears but Does Not Listen

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS

Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme-Based Analysis of Climate Discourse

Founding a company in Germany: €9600, 152 days and I still can't send an invoice

llama.cpp updates - granite-speech-4.1-2b, LFM2.5-ColBERT/Embedding-350M, Vulkan backend related changes & Misc items

NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction