Tag

Voice

365 articles archived under #voice · RSS

Smol AI News news-outlet 1mo ago

GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

**OpenAI** released **GPT-Realtime-2**, a voice model with **GPT-5-class reasoning**, tool use, interruption handling, and extended context windows up to **128K tokens**, achieving top scores on **Big Bench Audio** and **Conversational Dynamics** benchmarks. They also launched a…

22
OpenAI news 1mo ago

How OpenAI delivers low-latency voice AI at scale

How OpenAI rebuilt its WebRTC stack to power real-time Voice AI with low latency, global scale, and seamless conversational turn-taking.

26
vLLM releases dev-tools 2mo ago

v0.19.2rc0: [Bugfix] Fix k_proj's bias for GLM-ASR (#40160)

Signed-off-by: Rishapveer Singh [email protected]

4
NVIDIA Developer Blog official-blog 3mo ago

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition...

38
NVIDIA Developer Blog official-blog 3mo ago

Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale,...

37
Smol AI News news-outlet 3mo ago

not much happened today

**Google** launched **Gemini 3.1 Flash Live**, a realtime voice and vision agent model with **2x longer conversation memory**, supporting **70 languages** and **128k context**. **Mistral AI** released **Voxtral TTS**, a low-latency, open-weight text-to-speech model supporting…

31
Hugging Face official-blog 3mo ago

A New Framework for Evaluating Voice Agents (EVA)

Back to Articles A New Framework for Evaluating Voice Agents (EVA) Enterprise Article Published March 24, 2026 Upvote 92 Tara Bogavelli tarabogavelli ServiceNow-AI Gabrielle Gauthier Melancon gabegma ServiceNow-AI Katrina Stankiewicz kstankiewicz ServiceNow-AI Nifemi Bamgbose…

7
OpenAI Python SDK releases dev-tools 3mo ago

v2.28.0

2.28.0 (2026-03-13) Full Changelog: v2.27.0...v2.28.0 Features api: custom voices ( 50dc060 )

12
ThursdAI news-outlet 5mo ago

📆 ThursdAI - Jan 22 - Clawdbot deep dive, GLM 4.7 Flash, Anthropic constitution + 3 new TSS models

From Weights & Biases - deep dive into Clawdbot, an personal AI assistant that learns and evolves, GLM 4.7 Flash, a bunch of new TTS models and Claude's new constitution!

29
Google DeepMind official-blog 6mo ago

Improved Gemini audio models for powerful voice experiences

Improved Gemini audio models for powerful voice interactions Share x.com Facebook LinkedIn Mail Bibo Xu Director of Product Management Tara Sainath Distinguished Research Scientist General summary Google enhanced Gemini 2.5 Flash Native Audio for better live voice agents. Expect…

37
Hugging Face official-blog 7mo ago

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

Back to Articles Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks Published November 21, 2025 Update on GitHub Upvote 27 Eric Bezzam bezzam Steven Zheng Steveeeeeeen Eustache Le Bihan eustlb Vaibhav Srivastav reach-vb While everyone (and their…

30
Hugging Face official-blog 8mo ago

Voice Cloning with Consent

Back to Articles Voice Cloning with Consent Published October 28, 2025 Update on GitHub Upvote 40 Margaret Mitchell meg Lucie-Aimée Kaffee frimelle In this blog post, we introduce the idea of a 'voice consent gate' to support voice cloning with consent. We provide an example…

24
Nonint (James Betker) research 25mo ago

GPT-4o

I’m very pleased to show the world GPT-4o. I came into the project mid-last year with Alexis Conneau with the goal of scaling up speech models and building an “AudioLM”. We knew we had something special late last year, but I don’t think either of us…

22
Eugene Yan research 27mo ago

Building an AI Coach to Help Tame My Monkey Mind

Building an AI coach with speech-to-text, text-to-speech, an LLM, and a virtual number.

20
Chip Huyen research 33mo ago

Multimodality and Large Multimodal Models (LMMs)

For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition). However, natural intelligence is not limited to just a single modality. Humans can read, talk, and…

10

GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

How OpenAI delivers low-latency voice AI at scale

v0.19.2rc0: [Bugfix] Fix k_proj&#39;s bias for GLM-ASR (#40160)

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

not much happened today

A New Framework for Evaluating Voice Agents (EVA)

v2.28.0

📆 ThursdAI - Jan 22 - Clawdbot deep dive, GLM 4.7 Flash, Anthropic constitution + 3 new TSS models

Improved Gemini audio models for powerful voice experiences

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

Voice Cloning with Consent

GPT-4o

Building an AI Coach to Help Tame My Monkey Mind

Multimodality and Large Multimodal Models (LMMs)

v0.19.2rc0: [Bugfix] Fix k_proj's bias for GLM-ASR (#40160)