Tag

Voice

365 articles archived under #voice · RSS

arXiv — NLP / Computation & Language research 20d ago

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

arXiv:2606.10439v1 Announce Type: cross Abstract: The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work…

21
arXiv — NLP / Computation & Language research 20d ago

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

arXiv:2606.10475v1 Announce Type: cross Abstract: Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the…

18
r/LocalLLaMA community 20d ago

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Built a decision-reasoning engine (Orlog) and wanted to fine-tune a local model for it instead of paying per-call forever. The method (DV-DPO): Run a 3-voice council on each question, produce a synthesis Cross-examine: losing voices challenge the synthesis If synthesis gets…

35
Vercel — AI dev-tools 20d ago

Threshold billing is now enabled for Pro teams

Threshold billing now sends Pro teams a partial invoice mid-cycle once on-demand usage reaches a threshold, instead of holding all charges until the end of the billing period. Partial invoices and the end-of-cycle invoice add up to your total usage, so the same usage is never…

15
r/MachineLearning community 20d ago

iOS 27 Siri is using WaveRNN and FastSpeech2 [D]

Found from iOS Simulator's files. Both of them are in espresso format There's also another compiled CoreML for concert ranking and based on the content inside of it looks like to be a simple logistic regression. See…

38
TechCrunch — AI news-outlet 20d ago

Hey Siri, here’s what I actually want from AI

I'm desperate for a personal AI assistant, but do I really want to become the kind of person who can't function without the friendly robot voice in my phone?

4
Hugging Face official-blog 20d ago

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Back to Articles Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech Enterprise Article Published June 9, 2026 Upvote 4 Shama Gupta shamagupta ServiceNow-AI Lindsay Brin lindsaybrin ServiceNow-AI Fanny Riols FannyRiols ServiceNow-AI…

11
Ars Technica — AI news-outlet 20d ago

Google announces Gemini 3.5 Live Translate for instant voice-to-voice translation

Voice translations preserve speaker's tone, pacing, pitch—with SynthID watermarks for security.

16
llama.cpp releases dev-tools 20d ago

b9585

graph: Fix granite speech model inference by applying embedding scale when deepstack is not used ( #24357 ) llama-graph : apply embedding scale when deepstack is not used nits: remove non-existant hunyuan-vl from the tests apply suggestion from @gabe-l-hart Co-authored-by: Xuan…

25
r/MachineLearning community 20d ago

What will be the next breakthrough in ASR? [D]

Hey All, I am currently working on ASR models, and I have gathered some recent literature. From my literature search, it seems like the ASR models are getting more and more powerful due to two main things. Because pseudo-labelled data is growing, supervised models are rising…

35
r/LocalLLaMA community 20d ago

Text-to-Speech (TTS) Benchmark Revamped with Objective Standards and Blind Voting (46 models and counting)

Thank you to everyone who contributed to my previous post, providing feedback and various models to add, and questioning the rating system. You can now participate in a live blind voting to create a proper ELO for all the models that are added. Each new model that we add will…

23
The Information — AI news-outlet 20d ago

Broadcom to Help Finance Anthropic, OpenAI Chip Deals With Apollo, Blackstone

Broadcom said Tuesday that it is launching a new fund—backed by Apollo and Blackstone—to help finance more than 20 gigawatts of AI data centers through 2028 using chips designed by Broadcom, including projects tied to Anthropic and OpenAI. Apollo will lead an initial $35 billion…

19
Google DeepMind official-blog 20d ago

Fluid, natural voice translation with Gemini 3.5 Live Translate

Gemini 3.5 Live Translate brings near real-time, natural speech translation to Google AI Studio, Google Translate and Google Meet.

32
NVIDIA Developer Blog official-blog 20d ago

Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech

Training a speech AI model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine,...

9
r/LocalLLaMA community 20d ago

PSA: Throttle GPU power limits, with minor performance deficits

I just feel i need to post this here again so more people see: Test around with throttling the power limits of your GPUs, you will often find that you can save tons of power with only minor performance deficits. On my dual Radeon VII setup, i went from 250 to 100 watts per card,…

11
Hugging Face Daily Papers research 20d ago

Liberating LLM Capabilities in Full-Duplex Speech Models

Abstract A text-first tri-channel speech interface enables real-time interaction with visible text output alongside spoken responses, demonstrating superior performance in full-duplex conversational tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech-based large language…

21
Hugging Face Daily Papers research 20d ago

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Abstract Research demonstrates that hallucinations in Whisper ASR can be detected and reduced using internal representations from audio encoder activations and Sparse AutoEncoder latents, achieving significant hallucination rate reduction with minimal speech transcription…

20
OpenAI official-blog 20d ago

What Codex unlocks for Notion

How Notion uses Codex to one-shot specs, build AI Voice Input for the web, and multiply engineering power across small teams.

26
arXiv — Machine Learning research 21d ago

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

arXiv:2606.07610v1 Announce Type: new Abstract: State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful…

6
r/LocalLLaMA community 21d ago

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

I fine-tuned NVIDIA's Parakeet TDT 0.6B v2 for clinical speech and am releasing the weights as Omi Med STT v1 (CC-BY-4.0). Disclosure: I'm the founder of Omi Health and built this. Happy to dig into the training mix, benchmark, failure cases, quantization, or anything else. The…

14
The Information — AI news-outlet 21d ago

Apple Tries for Another Siri Reboot

Apple launched a much anticipated new version of its Siri voice assistant at the start of its annual developer conference on Monday, which users will be able to access through a new Siri app. The refreshed voice assistant, now called Siri AI, which uses Google’s Gemini models,…

31
Ars Technica — AI news-outlet 21d ago

Say hi to "Siri AI"—Apple announces new, more "conversational" voice assistant

New features coming this fall alongside two-tiered, Google-powered AI model overhaul.

6
TechCrunch — AI news-outlet 21d ago

Apple’s long-awaited AI Siri overhaul is finally here

The idea behind the new "Siri AI" is to turn the assistant from a voice controlled assistant into an AI companion that can do a lot more.

30
Hacker News — AI on Front Page community 21d ago

Massachusetts bans sale of precise location data in new privacy rights bill

Article URL: https://techcrunch.com/2026/06/08/massachusetts-votes-to-pass-new-privacy-rights-bill-that-bans-sale-of-precise-location-data/ Comments URL: https://news.ycombinator.com/item?id=48448012 Points: 214 # Comments: 34

29
Hugging Face Daily Papers research 21d ago

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by…

35
arXiv — Machine Learning research 22d ago

Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

arXiv:2606.06833v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) systems operating in real-time settings must process acoustic input under strict temporal constraints, where transcription decisions are inherently made on incomplete information. This causal…

34
arXiv — NLP / Computation & Language research 22d ago

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

arXiv:2606.06985v1 Announce Type: new Abstract: Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive…

28
arXiv — NLP / Computation & Language research 22d ago

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

arXiv:2606.07240v1 Announce Type: new Abstract: Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026…

21
arXiv — NLP / Computation & Language research 22d ago

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

arXiv:2606.06740v1 Announce Type: cross Abstract: Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech…

22
arXiv — NLP / Computation & Language research 22d ago

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

arXiv:2606.06743v1 Announce Type: cross Abstract: The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main…

21
arXiv — NLP / Computation & Language research 22d ago

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

arXiv:2606.07309v1 Announce Type: cross Abstract: Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question…

14
arXiv — NLP / Computation & Language research 22d ago

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

arXiv:2606.07435v1 Announce Type: cross Abstract: Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI…

28
r/LocalLLaMA community 22d ago

Best Local TTS solution

So I have been testing a bunch of different solutions for local TTS - nothing so far comes close to elevenlabs for dynamic ability, voices, cloning. I’d like to have a phone-compatible setup. So far the best I can find for edge devices is moss-nano and kokoro. Free/cloud so far…

25
Hugging Face Daily Papers research 22d ago

dots.tts Technical Report

Abstract A 2B-parameter continuous autoregressive text-to-speech model trained on a multilingual corpus achieves state-of-the-art performance on multiple benchmarks while enabling efficient low-latency speech generation through specialized distillation techniques. Generated by…

32
r/LocalLLaMA community 22d ago

Dockerized Nemotron 3.5 ASR — Switched from Parakeet, better multilingual support + streaming (4.5x realtime speed on cpu)

I was originally using Parakeet for my speech recognition pipeline but decided to give Nemotron 3.5 a shot. After testing it on some multilingual audio clips, it's been working great so far. What sold me: - Better language support (40+ locales from one model) - Native streaming…

17
r/LocalLLaMA community 23d ago

Serving TTS/cloning models on llama.cpp?

Are there any quality voice cloning and speech generation models that already have support in Llama.cpp or, more likely, vLLM-Omni? It would be nice to swap them out like any other inference model and use a common API, rather making a separate container or conda for each model I…

17
r/LocalLLaMA community 24d ago

dots.tts 2B🎙️ SOTA TTS from RedNote

🔗 Blog: https://rednote-hilab.github.io/dots.tts-demo/ 🔗 GitHub: https://github.com/rednote-hilab/dots.tts 🔗 Technical Report: https://arxiv.org/abs/2608.16894 dots.tts 🎙️ New open-source TTS from RedNote (Xiaohongshu) ✨ 2B parameters (Apache 2.0) ✨ Fully continuous…

16
Hugging Face Daily Papers research 25d ago

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

Abstract Code-switching automatic speech recognition models show limited generalization across unseen language pairs despite attempts at model merging and domain generalization techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Automatic Speech Recognition (ASR) has become…

35
arXiv — NLP / Computation & Language research 25d ago

Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

arXiv:2606.05179v1 Announce Type: new Abstract: Punctuation restoration improves ASR (Automatic Speech Recognition) readability. However streaming ASR requires online decisions with limited future context. In streaming ASR, the system predicts punctuation incrementally, which…

21
arXiv — NLP / Computation & Language research 25d ago

Multilingual Detection of Alzheimer's Disease from Speech: A Cross-Linguistic Transfer Learning Approach

arXiv:2606.05545v1 Announce Type: new Abstract: The development of multilingual Alzheimer's Disease Dementia (AD) detection models presents significant challenges due to the resource-intensive and time-consuming nature of language-specific model training. We propose a novel…

35
arXiv — NLP / Computation & Language research 25d ago

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

arXiv:2606.05561v1 Announce Type: new Abstract: Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve…

34
arXiv — NLP / Computation & Language research 25d ago

Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs

arXiv:2606.05569v1 Announce Type: new Abstract: Mispronunciation Detection and Diagnosis (MDD) has gained increasing importance in computer-assisted language learning and speech technology in recent years. In this paper, we propose a method for constructing statistical graphs…

7
arXiv — NLP / Computation & Language research 25d ago

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

arXiv:2606.05846v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across…

8
arXiv — NLP / Computation & Language research 25d ago

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

arXiv:2606.05931v1 Announce Type: new Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both.…

21
arXiv — NLP / Computation & Language research 25d ago

Automatic Labelling of Speech Translation Errors

arXiv:2606.06047v1 Announce Type: new Abstract: Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech…

21
arXiv — NLP / Computation & Language research 25d ago

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

arXiv:2606.06065v1 Announce Type: new Abstract: Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs.…

9
arXiv — NLP / Computation & Language research 25d ago

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

arXiv:2606.06177v1 Announce Type: new Abstract: Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an…

13
arXiv — NLP / Computation & Language research 25d ago

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

arXiv:2606.06211v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear…

18
arXiv — NLP / Computation & Language research 25d ago

From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation

arXiv:2606.06266v1 Announce Type: new Abstract: Hate speech detection is inherently subjective: people from different demographic groups perceive the same content very differently. Collecting enough annotations from multiple demographic groups is costly and difficult to scale.…

22
Hugging Face Daily Papers research 25d ago

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Abstract A bilingual multi-attribute benchmark for instruction-guided speech editing is introduced to systematically evaluate speech modification capabilities across atomic and compositional tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Instruction-guided speech editing…

16

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Threshold billing is now enabled for Pro teams

iOS 27 Siri is using WaveRNN and FastSpeech2 [D]

Hey Siri, here&#8217;s what I actually want from AI

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Google announces Gemini 3.5 Live Translate for instant voice-to-voice translation

b9585

What will be the next breakthrough in ASR? [D]

Text-to-Speech (TTS) Benchmark Revamped with Objective Standards and Blind Voting (46 models and counting)

Broadcom to Help Finance Anthropic, OpenAI Chip Deals With Apollo, Blackstone

Fluid, natural voice translation with Gemini 3.5 Live Translate

Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech

PSA: Throttle GPU power limits, with minor performance deficits

Liberating LLM Capabilities in Full-Duplex Speech Models

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

What Codex unlocks for Notion

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Apple Tries for Another Siri Reboot

Say hi to "Siri AI"&#8212;Apple announces new, more "conversational" voice assistant

Apple&#8217;s long-awaited AI Siri overhaul is finally here

Massachusetts bans sale of precise location data in new privacy rights bill

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

Best Local TTS solution

dots.tts Technical Report

Dockerized Nemotron 3.5 ASR — Switched from Parakeet, better multilingual support + streaming (4.5x realtime speed on cpu)

Serving TTS/cloning models on llama.cpp?

dots.tts 2B🎙️ SOTA TTS from RedNote

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

Multilingual Detection of Alzheimer's Disease from Speech: A Cross-Linguistic Transfer Learning Approach

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

Automatic Labelling of Speech Translation Errors

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Hey Siri, here’s what I actually want from AI

Say hi to "Siri AI"—Apple announces new, more "conversational" voice assistant

Apple’s long-awaited AI Siri overhaul is finally here