Tag

Voice

365 articles archived under #voice · RSS

arXiv — NLP / Computation & Language research 1mo ago

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

arXiv:2605.28833v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for…

12
arXiv — NLP / Computation & Language research 1mo ago

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

arXiv:2605.29188v1 Announce Type: new Abstract: Dictionary methods, topic models, and embedding-similarity scorers are widely used in CSS and management research to measure constructs such as "entrepreneurial spirit" in corporate speeches. We contribute a label-light measurement…

5
arXiv — NLP / Computation & Language research 1mo ago

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

arXiv:2605.27376v1 Announce Type: new Abstract: While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use…

34
arXiv — NLP / Computation & Language research 1mo ago

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

arXiv:2605.27383v1 Announce Type: new Abstract: Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by…

24
arXiv — NLP / Computation & Language research 1mo ago

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

arXiv:2605.27808v1 Announce Type: new Abstract: Data-aware post-training quantization (PTQ) minimizes a per-token reconstruction loss on a small calibration corpus, implicitly weighting positions by their empirical frequency. For \textbf{A}utomatic \textbf{S}peech…

14
arXiv — NLP / Computation & Language research 1mo ago

Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

arXiv:2605.27874v1 Announce Type: new Abstract: Most Automatic Speech Recognition (ASR) systems formulate transcription as a prediction problem over orthographic units such as characters, subwords, or words. Although effective, such representations do not explicitly reflect the…

13
arXiv — NLP / Computation & Language research 1mo ago

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

arXiv:2605.27984v1 Announce Type: new Abstract: Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment…

10
arXiv — NLP / Computation & Language research 1mo ago

When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

arXiv:2605.28211v1 Announce Type: new Abstract: SpeechLLMs are increasingly deployed in professional settings where domain customisation is standard practice: users supply context in prompts with sensitive information, fine-tune on proprietary recordings, or both. We identify…

6
arXiv — NLP / Computation & Language research 1mo ago

Why We Need Speech to Evaluate Speech Translation

arXiv:2605.28227v1 Announce Type: new Abstract: Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and…

35
arXiv — NLP / Computation & Language research 1mo ago

Building Community-Centred NLP Resources for Puno Quechua

arXiv:2605.28253v1 Announce Type: new Abstract: The preservation of under-resourced languages requires digital tools and resources shaped by and for their speakers. We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for…

30
The Information — AI news-outlet 1mo ago

The Hot New Way to Communicate with AI? Whispering

If you wander into the Manhattan office of AI startup Basis on a workday, you’ll see most of its 100 or so staffers whispering quietly into gooseneck microphones at their desks. They aren’t taking phone calls or talking with other humans at all. They’re speaking softly to their…

23
r/MachineLearning community 1mo ago

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

If you've ever tried to pick an STT vendor for a phone-based voice agent or call center product, you've probably hit this wall: you have plenty of real production audio, but it's unlabeled, so you can't compute WER on it. And the annotated public datasets (FLEURS, CommonVoice,…

31
arXiv — NLP / Computation & Language research 1mo ago

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

arXiv:2605.26978v1 Announce Type: new Abstract: Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target…

13
arXiv — NLP / Computation & Language research 1mo ago

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

arXiv:2605.27025v1 Announce Type: new Abstract: Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments…

27
arXiv — NLP / Computation & Language research 1mo ago

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

arXiv:2605.27030v1 Announce Type: new Abstract: Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution space. However, existing parallel TTS methods typically keep branches isolated…

10
arXiv — NLP / Computation & Language research 1mo ago

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

arXiv:2605.27062v1 Announce Type: new Abstract: State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented…

31
arXiv — NLP / Computation & Language research 1mo ago

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

arXiv:2605.27189v1 Announce Type: new Abstract: This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate…

7
Hacker News — AI on Front Page community 1mo ago

Uber, Lyft drivers in Massachusetts form first US ride-share union

Article URL: https://www.reuters.com/business/world-at-work/uber-lyft-drivers-massachusetts-form-first-us-ride-share-union-2026-05-26/ Comments URL: https://news.ycombinator.com/item?id=48281509 Points: 220 # Comments: 118

37
r/LocalLLaMA community 1mo ago

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

MOSS-TTS-v1.5 MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0 . It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For…

10
Hugging Face Daily Papers research 1mo ago

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

Abstract ASASR addresses spectral misalignment in image super-resolution by leveraging Riemannian geometry and adversarial training to improve structural fidelity and reduce artifacts. AI-generated summary Generative priors in Image Super-Resolution (SR) often compromise…

10
r/LocalLLaMA community 1mo ago

Self-hosted STT better than Whisper Large V3 Turbo that matches AssemblyAI quality?

I’m already using Whisper Large V3 Turbo self-hosted, but the accuracy still isn’t where I need it. I like AssemblyAI’s quality and want something self-hosted that: - Is clearly better than Whisper Large V3 Turbo - Can match or get close to AssemblyAI’s transcription quality -…

11
r/LocalLLaMA community 1mo ago

I finally put my NPU (Intel Arrow Lake) to use doing ASR for my smart home

I wrote about what I found in a deep dive elsewhere (which I will no mention because Reddit doesn't like cross linking) but I wanted to share it here since this is where I learn the most about AI stuff and I've seen before questions about NPUs, that are often dismissed as…

9
arXiv — Machine Learning research 1mo ago

Hardware-Aware Federated Learning for Speech Emotion Recognition

arXiv:2605.24712v1 Announce Type: new Abstract: Federated learning (FL) enables privacy-preserving collaborative training across distributed edge devices, but real deployments involve heterogeneous clients with different processing power, memory capacity, and communication…

16
arXiv — NLP / Computation & Language research 1mo ago

Raon-Speech Technical Report

arXiv:2605.23912v1 Announce Type: new Abstract: We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural…

11
arXiv — NLP / Computation & Language research 1mo ago

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

arXiv:2605.23975v1 Announce Type: new Abstract: Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission,…

14
arXiv — NLP / Computation & Language research 1mo ago

End-to-End Intracortical Speech Decoding from Neural Activity

arXiv:2605.24313v1 Announce Type: new Abstract: Current high-performing intracortical speech neuroprostheses achieve low word error rates but typically rely on external language models during inference, increasing memory, computation, and latency. In this work, we investigate…

38
arXiv — NLP / Computation & Language research 1mo ago

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

arXiv:2605.24451v1 Announce Type: new Abstract: Vietnamese exhibits substantial dialectal phonetic variation across Northern, Central, and Southern regions, where identical lexical items may be realized with markedly different pronunciations. Such variation poses challenges for…

24
arXiv — NLP / Computation & Language research 1mo ago

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

arXiv:2605.25404v1 Announce Type: new Abstract: Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However,…

28
r/MachineLearning community 1mo ago

Best architecture for seamless Bilingual TTS? (Azure / English + Korean) [D]

Hi guys, when building a language learning app (React Native/Expo frontend, Python backend) and I’ve hit a frustrating wall with Text-to-Speech. I need the app to read sentences that mix English instructions and Korean examples (e.g., "To say hello, we use the phrase 안녕하세요.").…

20
arXiv — Machine Learning research 1mo ago

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

arXiv:2605.23235v1 Announce Type: new Abstract: Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on under-represented dialects and accents, often misidentifying the input language…

29
arXiv — NLP / Computation & Language research 1mo ago

A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development

arXiv:2605.22828v1 Announce Type: new Abstract: This survey provides a comprehensive catalog of publicly available text and speech resources for two West African languages: Hausa, an Afroasiatic language with approximately 80-100 million speakers, and Fongbe, a Niger-Congo…

12
arXiv — NLP / Computation & Language research 1mo ago

AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse

arXiv:2605.23325v1 Announce Type: new Abstract: Social media has become a crucial arena for shaping public narratives during armed conflicts, providing space for both harmful and constructive communication. While hate speech and misinformation have been widely studied,…

36
arXiv — NLP / Computation & Language research 1mo ago

Benchmarking Gaslighting Attacks Against Speech Large Language Models

arXiv:2509.19858v2 Announce Type: replace Abstract: As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied…

18
Simon Willison community 1mo ago

Quoting Armin Ronacher

The most frustrating failure mode right now is that people submit issues that are not in their own voice. They contain an observed problem somewhere, but it has been thrown into a clanker and the clanker reworded it and made a huge mess of it. Typically, it was prompted so badly…

18
r/LocalLLaMA community 1mo ago

TTS Benchmark Comparison (all known TTS up until May 2026)

I was tired of not having a proper TTS related benchmark that I can use and test for personal projects, so I had to make one. Hopefully this helps those looking for running local TTS tools. Has Windows and Mac results already. Linux will be tested shortly (have a 5900XT and 3090…

23
TechCrunch — AI news-outlet 1mo ago

AI is being used to resurrect the voices of dead pilots

People used AI on a spectrogram image of cockpit recordings to reconstruct them, forcing the NTSB to temporarily block access to its docket system.

11
r/LocalLLaMA community 1mo ago

I fine-tuned Cohere Transcribe to support diarization and timestamps

Hi I'll keep it short: Cohere-transcribe is currently the best open source speech to text model (and possibly even better than other proprietary models). BUT it doesn't support diarization (speaker identification) and timestamps, even though there are tokens for it in the…

36
Ars Technica — AI news-outlet 1mo ago

US scrambles to stop Internet users re-creating dead pilots’ voices

Workaround flouts law that bans NTSB disclosures of cockpit audio recordings.

13
Hacker News — AI on Front Page community 1mo ago

Steve Wozniak cheered after telling students they have AI – actual intelligence

Article URL: https://www.businessinsider.com/steve-wozniak-apple-ai-graduation-speech-2026-5 Comments URL: https://news.ycombinator.com/item?id=48233563 Points: 243 # Comments: 197

7
arXiv — NLP / Computation & Language research 1mo ago

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

arXiv:2605.22170v1 Announce Type: new Abstract: In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in…

9
arXiv — NLP / Computation & Language research 1mo ago

Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

arXiv:2605.22435v1 Announce Type: new Abstract: Hate speech and misinformation frequently co-occur online, amplifying prejudice and polarization. Given their scale, using Large Language Models (LLMs) to assist expert counterspeech (CS) writing has gained interest, yet prior work…

31
arXiv — NLP / Computation & Language research 1mo ago

Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government

arXiv:2605.22650v1 Announce Type: new Abstract: As artificial intelligence (AI) systems become more common in our daily lives, it is important to understand how different stakeholders comprehend and envisage the role that these technologies play in shaping social, political, and…

5
r/LocalLLaMA community 1mo ago

Best solution to generate reports locally with graphs, charts? Beginner question.

So on a local lm like ollama, or lm studio etc. you can run questions and prompts. But it’s a text response and I am unable to have it generate pdfs or report files graphs. Such as a pie chart on my invoices. Or create a report for me on statistics. When I run kimi, or Claude…

11
Hugging Face Daily Papers research 1mo ago

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Abstract Mega-ASR framework improves robustness in real-world speech recognition through compound-data construction and progressive acoustic-to-semantic optimization techniques. AI-generated summary Despite rapid advances in automatic speech recognition (ASR) and large…

29
arXiv — NLP / Computation & Language research 1mo ago

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

arXiv:2605.20356v1 Announce Type: new Abstract: Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how…

4
arXiv — NLP / Computation & Language research 1mo ago

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

arXiv:2605.20712v1 Announce Type: new Abstract: Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error…

16
arXiv — NLP / Computation & Language research 1mo ago

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

arXiv:2605.20920v1 Announce Type: new Abstract: Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment…

27
arXiv — NLP / Computation & Language research 1mo ago

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

arXiv:2605.20946v1 Announce Type: new Abstract: The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during…

18
Hacker News — AI on Front Page community 1mo ago

College students drown out AI-praising commencement speeches with boos

Article URL: https://www.tomshardware.com/tech-industry/artificial-intelligence/college-students-drown-out-ai-praising-commencement-speeches-with-boos-deal-with-it-one-speaker-fires-back-as-students-heckle-positive-pitches-for-ais-role Comments URL:…

21
arXiv — NLP / Computation & Language research 1mo ago

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

arXiv:2605.19069v1 Announce Type: new Abstract: Code-switching -- the natural alternation between two languages within a single utterance -- represents one of the most challenging and under-studied conditions for automatic speech recognition (ASR). Existing commercial ASR…

31

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

Why We Need Speech to Evaluate Speech Translation

Building Community-Centred NLP Resources for Puno Quechua

The Hot New Way to Communicate with AI? Whispering

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

Uber, Lyft drivers in Massachusetts form first US ride-share union

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

Self-hosted STT better than Whisper Large V3 Turbo that matches AssemblyAI quality?

I finally put my NPU (Intel Arrow Lake) to use doing ASR for my smart home

Hardware-Aware Federated Learning for Speech Emotion Recognition

Raon-Speech Technical Report

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

End-to-End Intracortical Speech Decoding from Neural Activity

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

Best architecture for seamless Bilingual TTS? (Azure / English + Korean) [D]

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development

AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse

Benchmarking Gaslighting Attacks Against Speech Large Language Models

Quoting Armin Ronacher

TTS Benchmark Comparison (all known TTS up until May 2026)

AI is being used to resurrect the voices of dead pilots

I fine-tuned Cohere Transcribe to support diarization and timestamps

US scrambles to stop Internet users re-creating dead pilots’ voices

Steve Wozniak cheered after telling students they have AI – actual intelligence

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government

Best solution to generate reports locally with graphs, charts? Beginner question.

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

College students drown out AI-praising commencement speeches with boos

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German