Tag

Voice

365 articles archived under #voice · RSS

r/LocalLLaMA community 25d ago

Higgs Audio v3 TTS 4B. Built for voice chat. Support 100 languages and inline control.

  submitted by   /u/FerretLegitimate6929 [link]   [comments]

31
Hugging Face official-blog 25d ago

How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

Back to Articles How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent Enterprise + Article Published June 4, 2026 Upvote - Maryam Motamedi maryameee nvidia Adi- margolin Amargolin nvidia Francesco fciannella nvidia Myungjong Kim Myungjong nvidia Enas Albasiri…

4
arXiv — Machine Learning research 26d ago

Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

arXiv:2606.04678v1 Announce Type: new Abstract: End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. A natural approach is to reuse…

4
arXiv — NLP / Computation & Language research 26d ago

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

arXiv:2606.04474v1 Announce Type: new Abstract: Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T)…

37
arXiv — NLP / Computation & Language research 26d ago

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

arXiv:2606.04483v1 Announce Type: new Abstract: Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing…

18
arXiv — NLP / Computation & Language research 26d ago

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

arXiv:2606.04730v1 Announce Type: new Abstract: With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is…

7
arXiv — NLP / Computation & Language research 26d ago

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

arXiv:2606.04418v1 Announce Type: cross Abstract: Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency,…

6
Hugging Face Daily Papers research 26d ago

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

Abstract OpenSTBench presents a unified evaluation framework for speech translation systems that assesses multiple dimensions including translation quality, speech quality, and temporal consistency across different modalities and settings. Generated by…

15
TechCrunch — AI news-outlet 26d ago

These two founders left Goldman and Meta to build voice AI for markets everyone else overlooked

The startup's own stack for Africa and Middle East is now handling more than 17,000 calls per day.

21
arXiv — Machine Learning research 27d ago

CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

arXiv:2606.02998v1 Announce Type: new Abstract: Automated cough analysis offers a path to low-cost respiratory screening, but most existing work stops at binary COVID-19 detection. A practical tool needs to tell apart several respiratory conditions from one cough recording on a…

4
arXiv — NLP / Computation & Language research 27d ago

Benchmarking Speech-to-Speech Translation Models

arXiv:2606.03241v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and…

5
arXiv — NLP / Computation & Language research 27d ago

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

arXiv:2606.03504v1 Announce Type: new Abstract: We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated…

4
arXiv — NLP / Computation & Language research 27d ago

A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

arXiv:2606.03948v1 Announce Type: new Abstract: We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task…

16
arXiv — NLP / Computation & Language research 27d ago

Efficient ASR Training with Conversations that Never Happened

arXiv:2606.03957v1 Announce Type: new Abstract: Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with…

21
arXiv — NLP / Computation & Language research 27d ago

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

arXiv:2606.03967v1 Announce Type: new Abstract: We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated…

16
Hugging Face Daily Papers research 27d ago

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

Abstract Deep learning approach for co-speech gesture retrieval that uses semantic motion anchors to improve alignment between spoken text and gesture representations, enhancing both retrieval accuracy and semantic relevance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Learning…

17
TechCrunch — AI news-outlet 27d ago

Martin Scorsese becomes the latest — and most unlikely — Hollywood voice for AI

The caveat is that one of the world's most famous living directors is using the tech solely for storyboarding.

38
The Information — AI news-outlet 27d ago

5 Ways Companies Keep AI Bills in Check

Snowflake CEO Sridhar Ramaswamy on Monday became the latest executive to voice concerns over rising AI costs . “Are we worried about how much we are spending on AI inference across our internal teams? Absolutely,” he told my colleague Laura during Snowflake’s annual conference…

23
Smol AI News news-outlet 28d ago

Microsoft Build: MAI-Thinking-1 and MAI Family models, Surface RTX Spark Dev Box, and OpenClaw in Windows

**Microsoft** introduced **MAI-Thinking-1**, a **35B parameter MoE model** with **256K context**, achieving **97% on AIME 2025** and outperforming **Sonnet 4.6** in human preference tests. The broader **7-model MAI family** spans reasoning, code, image, speech, and voice, with…

37
r/LocalLLaMA community 28d ago

Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026

Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts. You can easily get more better quality if you set up the duration of the voice in output you want and some temperature and other changes. This was just used on default setting. It can be improved…

20
arXiv — NLP / Computation & Language research 28d ago

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

arXiv:2606.00460v1 Announce Type: new Abstract: Speech-aware large language models often generalize poorly to out-of-domain settings. We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering…

23
arXiv — NLP / Computation & Language research 28d ago

LaSR: Context-Aware Speech Recognition via Latent Reasoning

arXiv:2606.00507v1 Announce Type: new Abstract: Recent advances in Speech Large Language Models (Speech LLMs) have significantly enhanced spoken language understanding and reasoning. However, their contextual awareness is limited, struggling to perform speech recognition that…

4
arXiv — NLP / Computation & Language research 28d ago

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

arXiv:2606.01016v1 Announce Type: new Abstract: While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a…

19
arXiv — NLP / Computation & Language research 28d ago

Child-directed speech facilitates production, not comprehension, in BabyLMs

arXiv:2606.01045v1 Announce Type: new Abstract: Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of…

26
arXiv — NLP / Computation & Language research 28d ago

Challenger at MultiPRIDE: Is It Hate Speech or Reclaimed?

arXiv:2606.01298v1 Announce Type: new Abstract: The spread of hate speech has become increasingly harmful in modern digital environments, particularly on social networking platforms. While recent advances have shown promising results in automatic hate speech detection, a key…

34
r/MachineLearning community 28d ago

Full duplex vs half duplex - the spectrum of AI voice models [D]

It seems that there are two ways to build voice AI: Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every voice assistant works today. Full-duplex: two channels, both sides can talk at…

32
r/MachineLearning community 28d ago

Real-time multilingual ASR using rolling buffers and monolingual models [P]

I built a routing-based approach to lightweight real-time multilingual ASR as part of my research at Gladia. The core problem was how multilingual models that accurately handle mid-conversation language switches are often too big for most local hardware and have poor accuracy.…

36
Vercel — AI dev-tools 28d ago

Chat SDK adds AgentPhone support

Chat SDK now supports AgentPhone with the new vendor-official adapter . Give your bot its own phone number so it can handle voice calls and text messages using the same handlers you already write. When a call ends, the transcript is delivered as a message, allowing your bot to…

14
arXiv — NLP / Computation & Language research 29d ago

Your Multimodal Speech Model Says I Have a Face for Radio

arXiv:2605.30472v1 Announce Type: new Abstract: As large neural models have become better at language tasks, researchers are increasingly building multi- and omnimodal models that handle more modalities of data. One example is the expansion of speech recognition models to…

35
arXiv — NLP / Computation & Language research 29d ago

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

arXiv:2605.30608v1 Announce Type: new Abstract: Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is…

4
arXiv — NLP / Computation & Language research 29d ago

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

arXiv:2605.31432v1 Announce Type: new Abstract: Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based…

37
arXiv — NLP / Computation & Language research 29d ago

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

arXiv:2605.31469v1 Announce Type: new Abstract: Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint…

17
Hugging Face Daily Papers research 29d ago

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Abstract Swanbench-Speech addresses the lack of comprehensive long-form speech evaluation by providing a benchmark with diverse scenarios, multi-dimensional metrics, and insights into model limitations. AI-generated summary Recent advances in speech generation have enabled…

5
Hugging Face Daily Papers research 29d ago

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Abstract A zero-shot text-to-speech system called SwanVoice is presented that addresses expressive long-form multi-speaker dialogue synthesis by combining VAE, flow-matching DiT, and diffusion post-training techniques. AI-generated summary Zero-shot text-to-speech (TTS) has…

6
r/MachineLearning community 29d ago

Arabic ASR model struggling to converge during training [D]

i'm trying to train an ASR model using the LibriSpeech recipe from SpeechBrain (without the language model) on a 100-hour dataset of dialectal Arabic speech. the model architecture uses a Conformer-small encoder and a Transformer decoder, with a total of around 13M parameters.…

23
r/LocalLLaMA community 29d ago

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

I ported NVIDIA's Parakeet speech-to-text models to pure C++/ggml (the engine behind llama.cpp and whisper.cpp). It runs the FastConformer TDT / CTC / RNNT / hybrid models with no Python and no PyTorch, on CPU and GPU (CUDA, HIP, Vulkan, Metal). The goal was to match NeMo…

30
r/LocalLLaMA community 29d ago

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

I compared 13 abliterated variants of Gemma 4 E2B across weight analysis, KL divergence, HarmBench safety, and 8 benchmark tasks. 44 GPU hours on a single RTX 5090. Here is what actually works and what destroys capabilities. coder3101's variant achieves 96% ASR with capability…

17
TechCrunch — AI news-outlet 1mo ago

SoftBank says it will invest up to €75 billion to build French data centers

The goal, the firm said, is to develop and operate up to 5 gigawatts of additional data center capacity.

30
The Information — AI news-outlet 1mo ago

Softbank to Invest Up To 75 Billion Euros on AI Data Centers in France

SoftBank Group announced a commitment to develop and operate five gigawatts of AI data center capacity in France, with an investment of up to 75 billion euros, or about $87.5 billion. The commitment is SoftBank’s largest AI infrastructure investment to date in Europe, the…

13
r/LocalLLaMA community 1mo ago

Whisper.cpp is underwhelming

Hi, I'm running whisper.cpp with the best model I could find (ggml-large-v3) but after about 20 min of transcription it hallucinates a sentence that it will repeat endlessly until the end. Is there something I'm missing or should I cut my files to about 20 minutes length?  …

17
r/LocalLLaMA community 1mo ago

STT -> LLM -> TTS pipeline

Hey guys, I’m trying to learn about how to better create a STT LLM TTS pipeline. My current setup is running a 3090 on Ubuntu. I use llama.cpp to run Qwen 3.6 27B Q4 with pi-agent for tool calling, and I just run everything in the terminal, I haven’t really bothered with chat…

25
r/LocalLLaMA community 1mo ago

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

Under $1000 for 32gb vram from 2023, and ~300 watts draw... and this thing is outperforming the latest pick-your-vendor $5k mini pcs from 2026. So.. next question is can I make it squeeze 150 t/s with the same q4xl on cuda 13.3 this weekend. Anyone try it yet?   submitted by…

13
r/LocalLLaMA community 1mo ago

Fulloch V2: 100% Local Voice Assistant for Home Assistant & Obsidian (Runs on 16GB VRAM)

Hey everyone, following up on my r/LocalLLaMA post from a while back, I have spent some time testing how far I can push my 5060ti as a personal voice assistant. The stack is Qwen3.5-9B GGUF Q5_K_M, Qwen3-1.7B ASR, and Qwen3-1.7B TTS, delivering fast, real-time responses with…

21
r/LocalLLaMA community 1mo ago

made a local voice AI for windows you can talk to in any language. open source, bring your own key

been building this on and off for a while and finally got it to a point where i'm not embarrassed to share it, so here goes. it's called Shadow AI. basically a voice-first AI companion that runs on your own windows machine. you just talk to it and it talks back, no typing…

38
r/LocalLLaMA community 1mo ago

this new Moss tts 1.5 is damn good with voice cloning

https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-v1.5 I prefer this over fish audio s2 pro because fish audio dont allow commercial use Long Cat DiT 3.5 is also a another good model.   submitted by   /u/9r4n4y [link]   [comments]

38
Hugging Face Daily Papers research 1mo ago

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

Abstract A novel convex optimization framework for language detection in spoken dialogue systems that achieves high accuracy with efficient training and theoretical guarantees against dialectal variations under low-resource conditions. AI-generated summary Globalization and…

21
r/LocalLLaMA community 1mo ago

We gave a Reachy Mini a real-time voice brain

We attended an event the other day and found this little guy lying on our desk, a Reachy Mini from Hugging Face. It belongs to the daughter of the event organizer. We got curious about how it worked, and an hour later we'd given it a brain. The model basically becomes Reachy. It…

19
Hugging Face Daily Papers research 1mo ago

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Abstract ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages using diverse audio and speech models. AI-generated summary We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals…

15
arXiv — Machine Learning research 1mo ago

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

arXiv:2605.29543v1 Announce Type: new Abstract: Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This…

10
arXiv — Machine Learning research 1mo ago

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

arXiv:2605.29659v1 Announce Type: new Abstract: Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail…

4

Higgs Audio v3 TTS 4B. Built for voice chat. Support 100 languages and inline control.

How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

These two founders left Goldman and Meta to build voice AI for markets everyone else overlooked

CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

Benchmarking Speech-to-Speech Translation Models

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

Efficient ASR Training with Conversations that Never Happened

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

Martin Scorsese becomes the latest — and most unlikely — Hollywood voice for AI

5 Ways Companies Keep AI Bills in Check

Microsoft Build: MAI-Thinking-1 and MAI Family models, Surface RTX Spark Dev Box, and OpenClaw in Windows

Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

LaSR: Context-Aware Speech Recognition via Latent Reasoning

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Child-directed speech facilitates production, not comprehension, in BabyLMs

Challenger at MultiPRIDE: Is It Hate Speech or Reclaimed?

Full duplex vs half duplex - the spectrum of AI voice models [D]

Real-time multilingual ASR using rolling buffers and monolingual models [P]

Chat SDK adds AgentPhone support

Your Multimodal Speech Model Says I Have a Face for Radio

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Arabic ASR model struggling to converge during training [D]

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

SoftBank says it will invest up to €75 billion to build French data centers

Softbank to Invest Up To 75 Billion Euros on AI Data Centers in France

Whisper.cpp is underwhelming

STT -> LLM -> TTS pipeline

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

Fulloch V2: 100% Local Voice Assistant for Home Assistant & Obsidian (Runs on 16GB VRAM)

made a local voice AI for windows you can talk to in any language. open source, bring your own key

this new Moss tts 1.5 is damn good with voice cloning

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

We gave a Reachy Mini a real-time voice brain

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content