Tag

Voice

365 articles archived under #voice · RSS

arXiv — NLP / Computation & Language research 14d ago

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

arXiv:2606.15059v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior…

7
arXiv — NLP / Computation & Language research 14d ago

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

arXiv:2606.15266v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains…

16
arXiv — NLP / Computation & Language research 14d ago

Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

arXiv:2606.15325v1 Announce Type: new Abstract: Large language models are increasingly deployed for written pronunciation feedback in second-language (L2) English learning, under the assumption that their diagnoses are grounded in the supplied speech evidence rather than in…

4
arXiv — NLP / Computation & Language research 14d ago

ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

arXiv:2606.15984v1 Announce Type: new Abstract: Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the…

4
arXiv — NLP / Computation & Language research 14d ago

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

arXiv:2606.16009v1 Announce Type: new Abstract: Machine interpreting (MI), the live, real-time branch of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains…

23
arXiv — NLP / Computation & Language research 14d ago

Scaling Human and G2P Supervision for Robust Phonetic Transcription

arXiv:2606.16019v1 Announce Type: new Abstract: Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We…

20
arXiv — NLP / Computation & Language research 14d ago

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

arXiv:2606.16074v1 Announce Type: new Abstract: Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior…

38
arXiv — NLP / Computation & Language research 14d ago

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

arXiv:2606.16137v1 Announce Type: new Abstract: Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution,…

31
arXiv — NLP / Computation & Language research 14d ago

TMASC: Transmasculine Attitude and Speech Corpus

arXiv:2606.16351v1 Announce Type: new Abstract: We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the…

25
arXiv — Machine Learning research 15d ago

Beyond task performance: Decoding bioacoustic embeddings with speech features

arXiv:2606.14662v1 Announce Type: new Abstract: Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species…

6
arXiv — NLP / Computation & Language research 15d ago

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

arXiv:2606.14391v1 Announce Type: new Abstract: Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior…

15
arXiv — NLP / Computation & Language research 15d ago

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

arXiv:2606.14459v1 Announce Type: new Abstract: Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech…

6
arXiv — NLP / Computation & Language research 15d ago

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

arXiv:2606.14528v1 Announce Type: new Abstract: Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in.…

10
arXiv — NLP / Computation & Language research 15d ago

Multimodal Speaker Identification in Classroom Environments

arXiv:2606.13712v1 Announce Type: cross Abstract: Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework…

24
arXiv — NLP / Computation & Language research 15d ago

OLaPh: Optimal Language Phonemizer

arXiv:2509.20086v4 Announce Type: replace Abstract: Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary…

8
arXiv — NLP / Computation & Language research 15d ago

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

arXiv:2510.05150v3 Announce Type: replace Abstract: Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This…

7
r/LocalLLaMA community 15d ago

Gemma 12b less than 10 watts 6.5pp 1.3tg

Google pixel 10 pro Termux Llamacpp version: 9639 (ef8268fee) $ ./llama.cpp/build_vulkan/bin/llama-cli -m storage/downloads/gemma-4-12b-it-UD-Q3_K_XL.gguf --model-draft storage/downloads/mtp-gemma-4-12b-it.gguf --temp 1.0 --top-p 0.95 --top-k 64 --spec-type draft-mtp…

5
r/LocalLLaMA community 15d ago

Voice-to-voice chatbot update

I've been working on this after hours for a few months continuously improving it. Now at a point where the chatbot is close to real-time (thanks to SSE streaming) and also interruptible while preserving context of what was last said. 100% local and powered by Qwen3.5-397B…

33
r/LocalLLaMA community 15d ago

Gemma 4 models benchmarked on with Triple GPU

Hearing good things about Gemma 4. Ran a few models across my llama box. Kubuntu 26.04 OS. AMD Ryzen 5 3600 6-core CPU. 48 GiB of DDR4 3600 Mhz RAM. Nvidia GTX-1070 at 8GiB VRAM ( X 3 ) with 24GiB total VRAM. GPUs have power limit set to 120, 121, 122 watts using: sudo…

29
r/LocalLLaMA community 15d ago

Gemma 4 12B native encoder free voice input utilization suggest?

Hey everyone,  Like many of you, I’m looking into the newly released Gemma 4 12B to build a native speech-to-speech experience. Because of its unique encoder-free architecture, completely skipping the traditional STT bottleneck could be possible.  Right now, my…

20
r/MachineLearning community 16d ago

Confused, where to start [D]

Hello community, I am a backend + big data dev. I want to learn about the llms that generate voices. I also read some articles but almost everyone of them starts from regression. There are so much resources available right now that I am now confused where to begin with.  …

14
r/LocalLLaMA community 16d ago

ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning

https://reddit.com/link/1u4lk5c/video/kyhdw0uog07h1/player Links: Blog: https://zyphra.com/our-work/zonos2 Weights: https://huggingface.co/Zyphra/ZONOS2 Inference code: https://github.com/Zyphra/ZONOS2 Eval code: https://github.com/Zyphra/ZTTS1-Eval Model TTSDS Prosody Score ↑…

15
arXiv — NLP / Computation & Language research 18d ago

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

arXiv:2606.12902v1 Announce Type: new Abstract: Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while…

10
arXiv — NLP / Computation & Language research 18d ago

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

arXiv:2606.12911v1 Announce Type: new Abstract: Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST,…

8
arXiv — NLP / Computation & Language research 18d ago

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

arXiv:2606.13121v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive…

17
arXiv — NLP / Computation & Language research 18d ago

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

arXiv:2606.13464v1 Announce Type: new Abstract: Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires…

11
arXiv — NLP / Computation & Language research 18d ago

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

arXiv:2606.13507v1 Announce Type: new Abstract: Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech…

30
arXiv — NLP / Computation & Language research 18d ago

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

arXiv:2606.13630v1 Announce Type: new Abstract: The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for…

22
arXiv — NLP / Computation & Language research 18d ago

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

arXiv:2606.13544v1 Announce Type: cross Abstract: Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice…

35
r/LocalLLaMA community 18d ago

How I implemented ASR bias for voice transcription models [Open Source]

I've been spending the last couple of weeks building a Wispr Flow clone as an open source project. For context, it is a voice dictation app that lets you type faster, by speaking instead of actually typing. I spent the first week building the basic STT capabilities. One of the…

29
r/LocalLLaMA community 18d ago

Infinite Music Glitch on my Arduino with Magenta Realtime 2

I built a local voice AI realtime music setup where my ESP32 microcontroller talks to my MacBook over WebSockets. The microcontroller is just a tiny Arduino-based device with a mic and speaker, and the MacBook M4 Pro runs Magenta Realtime 2 locally and streams the audio back to…

38
Smol AI News news-outlet 19d ago

not much happened today

**Anthropic's Fable/Mythos export-control crisis** dominates AI news, highlighting the intersection of **national security** and frontier model access. Technical voices like **François Chollet** criticize opaque regulatory actions and advocate for **standardized benchmarks for…

6
arXiv — NLP / Computation & Language research 19d ago

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

arXiv:2606.11219v1 Announce Type: new Abstract: Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond transcription, Text-to-Audio Retrieval, Captioning, and Question-Answering accuracy remains…

32
arXiv — NLP / Computation & Language research 19d ago

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

arXiv:2606.11386v1 Announce Type: new Abstract: Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains…

28
arXiv — NLP / Computation & Language research 19d ago

Pretrained self-supervised speech models can recognize unseen consonants

arXiv:2606.11542v1 Announce Type: new Abstract: Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource…

17
arXiv — NLP / Computation & Language research 19d ago

Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models

arXiv:2606.11639v1 Announce Type: new Abstract: The popularization of automatic speech recognition (ASR) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data. Most of these studies…

18
arXiv — NLP / Computation & Language research 19d ago

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

arXiv:2606.11681v1 Announce Type: new Abstract: We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the…

15
arXiv — NLP / Computation & Language research 19d ago

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

arXiv:2606.11197v1 Announce Type: cross Abstract: Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has…

19
arXiv — NLP / Computation & Language research 19d ago

Massive Open-Vocabulary Keyword Spotting

arXiv:2606.11279v1 Announce Type: cross Abstract: Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with…

36
arXiv — NLP / Computation & Language research 19d ago

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

arXiv:2606.11429v1 Announce Type: cross Abstract: Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end…

29
r/LocalLLaMA community 19d ago

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

I kept wanting to talk to my local models instead of typing, but every voice setup wanted a GPU, shipped my audio to the cloud, or was macOS-only. So I built one that's none of those — and I benchmarked it, so these are real measured numbers, not vibes. One command installs the…

12
Hugging Face Daily Papers research 19d ago

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Abstract Sparse autoencoders trained on language model representations reveal interpretable features for speech synthesis that can be manipulated to control linguistic and prosodic attributes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Language models increasingly serve as the…

19
r/LocalLLaMA community 19d ago

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

I'm trying to use Gemma 4 12B — the new encoder-free unified model (audio/vision/text in one) — for a one-pass audio → response voice assistant: feed the recorded WAV + system prompt and get the reply back as text directly, collapsing the separate ASR + LLM steps into a single…

31
arXiv — Machine Learning research 20d ago

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

arXiv:2606.09962v1 Announce Type: new Abstract: Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers…

14
arXiv — NLP / Computation & Language research 20d ago

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

arXiv:2606.10029v1 Announce Type: cross Abstract: Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train…

22
arXiv — NLP / Computation & Language research 20d ago

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

arXiv:2606.10581v1 Announce Type: new Abstract: Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs)…

11
arXiv — NLP / Computation & Language research 20d ago

Speaker Group Encoding in Self-supervised Speech Recognition Models

arXiv:2606.10654v1 Announce Type: new Abstract: We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech…

10
arXiv — NLP / Computation & Language research 20d ago

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

arXiv:2606.10675v1 Announce Type: new Abstract: We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech…

34
arXiv — NLP / Computation & Language research 20d ago

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

arXiv:2606.11167v1 Announce Type: new Abstract: Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level…

23
arXiv — NLP / Computation & Language research 20d ago

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

arXiv:2606.06037v2 Announce Type: cross Abstract: Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability…

29

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

Scaling Human and G2P Supervision for Robust Phonetic Transcription

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

TMASC: Transmasculine Attitude and Speech Corpus

Beyond task performance: Decoding bioacoustic embeddings with speech features

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

Multimodal Speaker Identification in Classroom Environments

OLaPh: Optimal Language Phonemizer

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Gemma 12b less than 10 watts 6.5pp 1.3tg

Voice-to-voice chatbot update

Gemma 4 models benchmarked on with Triple GPU

Gemma 4 12B native encoder free voice input utilization suggest?

Confused, where to start [D]

ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

How I implemented ASR bias for voice transcription models [Open Source]

Infinite Music Glitch on my Arduino with Magenta Realtime 2

not much happened today

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

Pretrained self-supervised speech models can recognize unseen consonants

Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

Massive Open-Vocabulary Keyword Spotting

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

Speaker Group Encoding in Self-supervised Speech Recognition Models

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech