Tag

Voice

365 articles archived under #voice · RSS

arXiv — NLP / Computation & Language research 1mo ago

FormalASR: End-to-End Spoken Chinese to Formal Text

arXiv:2605.19266v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented…

29
arXiv — NLP / Computation & Language research 1mo ago

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

arXiv:2605.19711v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error…

21
arXiv — NLP / Computation & Language research 1mo ago

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

arXiv:2605.19833v1 Announce Type: cross Abstract: Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic…

37
TechCrunch — AI news-outlet 1mo ago

You can now talk to your Gmail inbox, as seen at Google IO 2026

Google expands Gmail’s AI Inbox with conversational voice search, letting users ask Gemini to find buried email details.

15
r/LocalLLaMA community 1mo ago

Qwen3.6:27B VRAM 16GB 5080: MTP Quant, Speeds, and Configs

For those of you running Qwen3.6:27B on 16GB VRAM, what quantization did you settle on? For my primary purpose as a HA voice assistant, I've found my ideal target to be >50 tg and >800 pp. Qwen3.5:9B works really fast, but I'm experimenting with higher intelligence. Offloaded…

14
TechCrunch — AI news-outlet 1mo ago

Google adds voice-based prompting to Docs and Keep

Google is letting users create drafts, take notes, and search for email with voice with the new Workspace update

16
TechCrunch — AI news-outlet 1mo ago

Google’s AI now lets you talk to your Gmail inbox

Google expands Gmail’s AI Inbox with conversational voice search, letting users ask Gemini to find buried email details.

21
r/LocalLLaMA community 1mo ago

Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates

Disclosure: I made this. Open-source, MIT, Windows + Linux. Not affiliated with voiceflow.com (the chatbot SaaS, name collision, sorry). Why this exists: I wanted local-only dictation and meeting transcription, because audio shouldn't have to leave the machine just to become…

13
r/LocalLLaMA community 1mo ago

Audio upscaling, cleanup, or improvement models?

I never see this type of model talked about. Are there many open models in the category? I do a lot of audio cleanup and end up using auphonic but would like to be using a local model. Edit: e.g like voice recovery, reverb removal, auto-EQ type stuff   submitted by  …

5
arXiv — Machine Learning research 1mo ago

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

arXiv:2605.16545v1 Announce Type: new Abstract: After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems…

4
arXiv — NLP / Computation & Language research 1mo ago

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

arXiv:2605.16896v1 Announce Type: new Abstract: Contextual Automatic Speech Recognition (ASR) faces challenges with large-scale keyword dictionaries, as excessive irrelevant candidates introduce noise that degrades accuracy. To address this, dynamic filtering typically uses a…

20
arXiv — NLP / Computation & Language research 1mo ago

LLMs for automatic annotation of Mandarin narrative transcripts

arXiv:2605.17205v1 Announce Type: new Abstract: Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown…

32
arXiv — NLP / Computation & Language research 1mo ago

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

arXiv:2605.17443v1 Announce Type: new Abstract: We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our…

29
arXiv — NLP / Computation & Language research 1mo ago

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

arXiv:2605.17652v1 Announce Type: new Abstract: There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows…

10
arXiv — NLP / Computation & Language research 1mo ago

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

arXiv:2605.17710v1 Announce Type: new Abstract: Although modern multilingual Automatic Speech Recognition (ASR) systems support several Nigerian languages, their performance consistently lags behind high-resource languages like English and French. Nigerian languages present…

14
arXiv — NLP / Computation & Language research 1mo ago

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

arXiv:2605.17860v1 Announce Type: new Abstract: While modern Automatic Speech Recognition (ASR) systems achieve high accuracy on benchmark corpora, their performance often degrades when there is real-world variability. This work focuses on variability arising due to accented,…

36
arXiv — NLP / Computation & Language research 1mo ago

Bridging the Gap: Converting Read Text to Conversational Dialogue

arXiv:2605.18001v1 Announce Type: new Abstract: In recent advancements within speech processing, converting read speech to conversational speech has gained significant attention. The primary challenge in this domain is maintaining naturalness and intelligibility while minimizing…

20
r/MachineLearning community 1mo ago

Architecture advice: Real-time pipeline for YouTube Audio -> Whisper -> LLM -> SSE (Sub-10s latency) [D]

Hey everyone, I’m building a backend that analyzes long YouTube videos using an LLM. Currently, my flow is a slow waterfall: Download full audio -> Whisper -> LLM -> Return results . For a 30-minute video, the user waits forever. I want to pipeline this for real-time SSE…

5
r/LocalLLaMA community 1mo ago

21 GPU's benchmarked running a small TTS model (vram peak: 5GB)

I rented different GPUs on vast.ai for a few minutes each to benchmark a small TTS model, OmniVoice, with a peak VRAM usage of about 5 GB. I wanted to see how various mostly consumer GPUs would stack up against my own RTX 3090. This is by no means an extensive or scientific…

16
MIT Technology Review — AI news-outlet 1mo ago

Inside Anduril and Meta’s quest to make smart glasses for warfare

The defense-tech company Anduril has shared new details about the augmented-reality headset for the military it’s prototyping with Meta, including a vision for ordering drone strikes via eye-tracking and voice commands. Quay Barnett, who leads the efforts as a vice president at…

6
r/LocalLLaMA community 1mo ago

Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU

Wanted a real head to head on the two TTS models that actually run well on CPU. Couldn't find one with proper numbers, so I ran one. Posting because the result was not what I expected going in. Quick context for anyone who hasn't seen Supertonic 3 yet: it's a flow-matching TTS…

35
Hacker News — AI on Front Page community 1mo ago

Eric Schmidt speech about AI booed during graduation

Article URL: https://www.nbcnews.com/tech/tech-news/former-google-ceo-booed-graduation-speech-ai-rcna345585 Comments URL: https://news.ycombinator.com/item?id=48177785 Points: 242 # Comments: 218

36
arXiv — NLP / Computation & Language research 1mo ago

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

arXiv:2605.15886v1 Announce Type: new Abstract: This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics…

9
arXiv — NLP / Computation & Language research 1mo ago

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

arXiv:2605.16026v1 Announce Type: new Abstract: Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language…

29
arXiv — NLP / Computation & Language research 1mo ago

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

arXiv:2605.16077v1 Announce Type: new Abstract: Accurate assessment of cognitive decline from spontaneous speech remains challenging due to limited dataset size and class imbalance. In this work, we propose a large language model (LLM)-driven data augmentation framework to…

38
TechCrunch — AI news-outlet 1mo ago

If you’re giving a commencement speech in 2026, maybe don’t mention AI

It's tough to get graduating students excited about a future shaped by artificial intelligence.

20
r/LocalLLaMA community 1mo ago

GitHub - richardr1126/openreader: An open-source read-along document reader server with high-quality TTS options, synchronized highlighting, and audiobook export for EPUB, PDF, DOCX, TXT, and MD.

Sharing my latest release of OpenReader v3.0.0, an open-source text-to-speech document reader and audiobook exporter. It has been live for over a year now, and slowly has gained 300+ GitHub stars. What is OpenReader? A Next.js web app for reading and listening to EPUB, PDF, TXT,…

9
The Information — AI news-outlet 1mo ago

OpenAI Buys AI Voice Startup Weights

OpenAI bought Weights.GG, a small startup that made an AI voice-cloning tool called Replay, in January, according to a person familiar with the acquisition. A half dozen employees joined OpenAI, which bought the startup’s intellectual property but does not plan to integrate the…

36
r/LocalLLaMA community 1mo ago

macOS support in Lemonade has graduated out of beta!

All major Lemonade capabilities, including OmniRouter, coding, image gen, speech gen, and transcription are all available on Lemonade for macOS thanks to the hard work of u/GeramyL . If you're on macOS and just looking into Lemonade for the first time, we're a local AI solution…

18
r/LocalLLaMA community 1mo ago

Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention. 12K context, native system role, sampler defaults from the model card. Cached TTFT around 200ms, sustained 14-15 tok/s. SenseVoiceSmall for STT, Piper for TTS with…

21
r/LocalLLaMA community 1mo ago

GitHub - pwilkin/openmoss: OpenMOSS pure C++ pipeline based on GGML

I'm uploading a full GGML-based pipeline for OpenMOSS ( https://huggingface.co/OpenMOSS-Team/MOSS-TTS ) that I've vibe-coded for myself in case someone else finds it useful. TTS models are notoriously annoying to set up due to the entire Python ecosystem, so I decided I'd make…

29
arXiv — NLP / Computation & Language research 1mo ago

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

arXiv:2605.14427v1 Announce Type: new Abstract: In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive…

15
arXiv — NLP / Computation & Language research 1mo ago

Streaming Speech-to-Text Translation with a SpeechLLM

arXiv:2605.14766v1 Announce Type: new Abstract: Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the…

7
arXiv — NLP / Computation & Language research 1mo ago

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

arXiv:2605.15104v1 Announce Type: new Abstract: Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling…

19
arXiv — NLP / Computation & Language research 1mo ago

A Benchmark for Early-stage Parkinson's Disease Detection from Speech

arXiv:2605.14066v1 Announce Type: cross Abstract: Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and…

21
Hugging Face Daily Papers research 1mo ago

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Abstract EVA-Bench presents a comprehensive evaluation framework for voice agents that simulates realistic conversations and measures performance across multiple voice-specific failure modes using novel accuracy and experience metrics. AI-generated summary Voice agents,…

21
Hacker News — AI on Front Page community 1mo ago

MIT: 20% drop in incoming graduate students

Article URL: https://president.mit.edu/writing-speeches/video-transcript-message-president-kornbluth-about-funding-and-talent-pipeline Comments URL: https://news.ycombinator.com/item?id=48136262 Points: 211 # Comments: 195

21
r/LocalLLaMA community 1mo ago

[MIT] RLCR: Teaching AI models to say "I'm not sure"

Confidence is persuasive. In AI systems, it is often misleading. Today's most capable reasoning models share a trait with the loudest voice in the room: They deliver every answer with the same unshakable certainty, whether they're right or guessing. Researchers at MIT's Computer…

23
r/LocalLLaMA community 1mo ago

Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future.

I'm the founder behind Hedy, an AI meeting app. I'm a huge supporter of Local AI, and we've been working on making it "consumer friendly". Speech recognition in Hedy has always run on-device (whisper.cpp and now also parakeet). What just shipped is that the rest of the AI…

22
r/LocalLLaMA community 1mo ago

Scenema Audio: Zero-shot expressive voice cloning and speech generation

We've been building Scenema Audio as part of our video production platform at scenema.ai, and we're releasing the model weights and inference code. The core idea: emotional performance and voice identity are independent. You describe how the speech should be performed (rage,…

17
r/LocalLLaMA community 1mo ago

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality). ~45 minutes end-to-end on a single AMD…

13
Hugging Face Daily Papers research 1mo ago

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Abstract Research identifies studio-bias in multilingual ASR fine-tuning and proposes R-MFT method to improve spontaneous speech performance while maintaining efficiency. AI-generated summary Fine-tuning multilingual ASR models like Whisper for low-resource languages often…

20
r/MachineLearning community 1mo ago

Scenema Audio: Zero-shot expressive voice cloning and speech generation [N]

We've been building Scenema Audio as part of our video production platform at scenema.ai, and we're releasing the model weights and inference code. The core idea: emotional performance and voice identity are independent. You describe how the speech should be performed (rage,…

37
r/LocalLLaMA community 1mo ago

DramaBox - Most Expressive Voice model ever based on LTX 2.3

The Most Expressive Voice Model. Github: https://github.com/resemble-ai/DramaBox HF Model: https://huggingface.co/ResembleAI/Dramabox HF Space: https://huggingface.co/spaces/ResembleAI/Dramabox   submitted by   /u/manmaynakhashi [link]   [comments]

22
arXiv — NLP / Computation & Language research 1mo ago

Predicting Psychological Well-Being from Spontaneous Speech using LLMs

arXiv:2605.11303v1 Announce Type: new Abstract: We investigate the use of Large Language Models (LLMs) for zero-shot prediction of Ryff Psychological Well-Being (PWB) scores from spontaneous speech. Using a few minutes of voice recordings from 111 participants in the PsyVoiD…

7
arXiv — NLP / Computation & Language research 1mo ago

Mechanistic Interpretability of ASR models using Sparse Autoencoders

arXiv:2605.12225v1 Announce Type: new Abstract: Understanding the internal machinations of deep Transformer-based NLP models is more crucial than ever as these models see widespread use in various domains that affect the public at large, such as industry, academia, finance,…

24
arXiv — NLP / Computation & Language research 1mo ago

Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs

arXiv:2605.12242v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left…

5
r/LocalLLaMA community 1mo ago

I built Derpy Turtle: The Kokoro Trainer, a GUI for training better Kokoro voices with RVC

I’ve been working on a tool called Derpy Turtle: The Kokoro Trainer. It started as a random-walk experiment for Kokoro voices, but it has grown into its own thing: a Windows GUI for creating better local voice outputs by combining Kokoro voice search with RVC voice conversion.…

9
Latent.Space news-outlet 1mo ago

[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - advances SOTA Realtime Voice and kills standard VAD

well done, Team Thinky.

26
Latent.Space news-outlet 1mo ago

[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

OpenAI continues deploying GPT-5 everywhere

18

FormalASR: End-to-End Spoken Chinese to Formal Text

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

You can now talk to your Gmail inbox, as seen at Google IO 2026

Qwen3.6:27B VRAM 16GB 5080: MTP Quant, Speeds, and Configs

Google adds voice-based prompting to Docs and Keep

Google&#8217;s AI now lets you talk to your Gmail inbox

Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates

Audio upscaling, cleanup, or improvement models?

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

LLMs for automatic annotation of Mandarin narrative transcripts

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

Bridging the Gap: Converting Read Text to Conversational Dialogue

Architecture advice: Real-time pipeline for YouTube Audio -> Whisper -> LLM -> SSE (Sub-10s latency) [D]

21 GPU's benchmarked running a small TTS model (vram peak: 5GB)

Inside Anduril and Meta’s quest to make smart glasses for warfare

Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU

Eric Schmidt speech about AI booed during graduation

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

If you’re giving a commencement speech in 2026, maybe don’t mention AI

GitHub - richardr1126/openreader: An open-source read-along document reader server with high-quality TTS options, synchronized highlighting, and audiobook export for EPUB, PDF, DOCX, TXT, and MD.

OpenAI Buys AI Voice Startup Weights

macOS support in Lemonade has graduated out of beta!

Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

GitHub - pwilkin/openmoss: OpenMOSS pure C++ pipeline based on GGML

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

Streaming Speech-to-Text Translation with a SpeechLLM

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

A Benchmark for Early-stage Parkinson's Disease Detection from Speech

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

MIT: 20% drop in incoming graduate students

[MIT] RLCR: Teaching AI models to say "I'm not sure"

Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future.

Scenema Audio: Zero-shot expressive voice cloning and speech generation

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Scenema Audio: Zero-shot expressive voice cloning and speech generation [N]

DramaBox - Most Expressive Voice model ever based on LTX 2.3

Predicting Psychological Well-Being from Spontaneous Speech using LLMs

Mechanistic Interpretability of ASR models using Sparse Autoencoders

Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs

I built Derpy Turtle: The Kokoro Trainer, a GUI for training better Kokoro voices with RVC

[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - advances SOTA Realtime Voice and kills standard VAD

[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

Google’s AI now lets you talk to your Gmail inbox