News / #music Tag Music 180 articles archived under #music · RSS Sign in to follow Vercel — AI dev-tools 22h ago Build realtime voice agents on AI Gateway AI Gateway now supports audio/voice. You can add realtime voice, text to speech, and speech to text with the same calls you already use for text, image, and video, routed through AI Gateway alongside every other modality. Audio launches with models from OpenAI and xAI . Each… 26 arXiv — Machine Learning research 1d ago HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models arXiv:2606.27627v1 Announce Type: new Abstract: Discrete audio representations have become increasingly popular for building multimodal text-audio systems and integrating audio capabilities into Large Language Models (LLMs). However, numerous studies report performance… 7 arXiv — Machine Learning research 1d ago A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset arXiv:2606.27886v1 Announce Type: new Abstract: Recent advances in Human Activity Recognition (HAR) from wearable sensors have shown that multi-modal deep learning models consistently outperform their uni-modal counterparts. Modalities can include IMUs, RGB cameras, audio… 27 arXiv — Machine Learning research 1d ago Elastic Time: Dynamic Frame Rate Bottlenecks for Neural Audio Coding arXiv:2606.27320v1 Announce Type: cross Abstract: Neural audio autoencoders have become a core component of compression, feature extraction, and generation. However, while existing systems support variable bitrate, the vast majority of models still operate at a fixed latent… 38 Vercel — AI dev-tools 1d ago Realtime voice, speech, and transcription now supported on AI Gateway AI Gateway now supports voice and audio models. You can build realtime voice agents, generate speech from text, and transcribe audio to text. This provides the same observability, spend controls, and bring-your-own-key support as text, image, and video models in AI Gateway, with… 17 Vercel — AI dev-tools 1d ago xAI Grok audio models now available on Vercel AI Gateway xAI's audio models are now live on AI Gateway. Realtime voice, text to speech, and speech to text are all available through the AI SDK with the same routing, observability, and spend controls as your other models. These capabilities are available on the AI SDK 7 release.… 11 r/LocalLLaMA community 2d ago Any better models in coding for single dgx spark in near future? I’m an owner of single dgx spark with 128 gb unified memory. and I’m hosting through all my local network my ppm over lmstudio. I’m mainly using it for coding,some long document sorting tasks and some sequruty testing. my favorite rn is stepfun step-3.7-flash q3 xxl it’s a bit… 32 r/LocalLLaMA community 4d ago audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA I’ve been working on audio.cpp , a native C++ inference framework for audio models built on top of ggml. The framework currently has 25 model families, but I want to be precise about its state: 12 are released in the repo now and ready for normal use. I’m not counting anything… 24 r/MachineLearning community 4d ago Looking for arXiv endorsement (eess.AS or cs.SD) [R] Hi, I'm an undergrad researcher looking for an arXiv endorsement to submit my first paper in the audio/speech processing domain (keyword spotting on microcontrollers). I've submitted to a peer-reviewed IEEE conference and am awaiting results, but want to get a preprint up in the… 26 Vercel — AI dev-tools 4d ago AI SDK 7 is now available AI SDK 7 is a major release for building production agents in TypeScript. The SDK has grown from model calls and chat primitives into a broader agent platform for developing, running, integrating, and observing agents across text, audio, realtime, image, and video. Every major… 8 Hugging Face Daily Papers research 5d ago UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating Abstract UnityShots is a memory-driven audio-video generation system that maintains consistent subject appearance and audio across video cuts using fixed-size long-term and short-term memory slots with boundary-conditioned gates and discrete cut-type priors. Generated by… 7 arXiv — NLP / Computation & Language research 5d ago Robustness assessment of large audio language models in multiple-choice evaluation arXiv:2510.04584v2 Announce Type: replace Abstract: Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in… 13 Hugging Face Daily Papers research 5d ago Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models Abstract Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present… 20 arXiv — NLP / Computation & Language research 6d ago AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression arXiv:2606.24286v1 Announce Type: new Abstract: Multimodal Large Language Models have achieved remarkable progress in short-form audio-video understanding, yet long-form audio-video comprehension remains challenged by limited context windows and severe information redundancy. To… 15 arXiv — NLP / Computation & Language research 6d ago Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams arXiv:2606.24523v1 Announce Type: new Abstract: Scam phone calls exploit vulnerable communities worldwide, yet research on detection has focused almost exclusively on English and other high-resource languages. In low-resource settings such as Turkish, detection is especially… 11 arXiv — NLP / Computation & Language research 6d ago ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained… 15 r/LocalLLaMA community 6d ago GLM 5.2 on Mac Studio Speedup PR Just a heads up for the lucky few 512 gb mac owners: GLM 5.2 is a game changer because prefill speeds stay above 100 t/s at much higher context, and also take less space, so we can run 4 bit quants well above 100k context. See this PR by the oMLX creator:… 5 Hugging Face Daily Papers research 6d ago Libretto: Giving LLM Agents a Sense of Musical Structure Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from… 18 r/LocalLLaMA community 6d ago CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample Ran three open-weight TTS models head to head on CPU. Intel Xeon, 4 cores, 15.6GB RAM, no GPU. Five configs, six text lengths from 12 to 1712 chars, 5 timed reps per cell after warmup, 150 timed runs total. Every audio output scored with UTMOS (utmos22_strong) so quality isn't… 19 Hugging Face Daily Papers research 6d ago Improving Text-to-Music Generation with Human Preference Rewards Abstract A text-to-music generation system uses reward conditioning, expert iteration, and preference tuning to improve audio quality while maintaining efficiency within a 120M-parameter model framework. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We describe our entry to the… 19 r/LocalLLaMA community 7d ago EU AI Act requires TEXT from models and providers to be watermarked 2nd August onwards. Everyone here is affected, regardless where you live. Anyone hate the cookie banners ? Those are absolutely nothing in comparison to what is about to come. The AI Act requires lots of things, many people know it requires every AI modified or generated audiofile to be metadata tagged and fingerprint-watermarked from August on (32M$… 9 r/MachineLearning community 7d ago Recommendations for speech annotation tools [D] I'm looking for human-in-the-loop platforms that allow you to automatically transcribe audio followed by manually fixing the transcriptions and fine tuning the model. Is there a local (not an online service) installable platform for doing this?   submitted by  … 11 r/LocalLLaMA community 9d ago Qwen code companion on vscode marketplace - thoughts I just came across this extension in vscode few days ago and tried to use with LM studio hosted models and it really is pretty good compared to `continue`, `kilo`, `cline`, `roo` like I felt without much tweaks, gets straight to the point, if any tweaks required u could do… 36 r/LocalLLaMA community 10d ago Local agent on 4090 - looking for LM Studio settings I have moved on from Ollama to just dink around and instead want to start running a local agent from time to time. With the 24GB of a 4090 (Gigabyte OC edition) that should be quite possible. But no matter what settings I use for context and batching, token generation is slow as… 36 r/LocalLLaMA community 10d ago Single RTX 3090 (MSI TRio) giving trouble on inference. Hi, I'm having weird issues with my 3090 on inferencerence via lmstudio , it just: unloads the model/ model crashes + nvidia driver resets freezes the pc gives blue/black screen and the computer restarts or straight up restarts everything. I tried running it regularly,… 33 r/LocalLLaMA community 10d ago Best Harness for Web Searching Looking for opinions on the best software to do web searching resources. What I've tried: LM Studio + plugins Odysseus I think the problem they're both running into is the search engines they're using max out at like, 10 requests per day/hour or something without an api. I don't… 17 Hugging Face Daily Papers research 10d ago Duration Aware Scheduling for ASR Serving Under Workload Drift Abstract Duration-aware scheduling policies improve ASR serving latency by leveraging audio length as a predictor for processing time, with SJF and HRRN algorithms showing significant median latency reductions while maintaining throughput. Generated by… 26 r/LocalLLaMA community 10d ago GLM-5.2 can now run locally in llama.cpp and Unsloth Studio. The 2-bit model retains ~82% accuracy after we shrunk it from 1.51TB to 238GB (-84% size). Run on a 256GB Mac or RAM/VRAM setups. GLM-5.2 is the strongest open model to date. Check the graph for the accuracy of each GLM-5.2-GGUF quantization. Full guide:… 35 arXiv — NLP / Computation & Language research 11d ago ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion arXiv:2606.20179v1 Announce Type: new Abstract: Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial… 21 Hugging Face Daily Papers research 11d ago MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model Abstract MaineCoon represents the first real-time audio-visual autoregressive model for social worlds, achieving high frame rates and long-horizon generation through novel training techniques and inference frameworks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As an increasing… 21 r/LocalLLaMA community 11d ago I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it? Yes I know this is a simple question I could just ask Claude or something but I want to see what the community suggests For context it’s a 16in MacBook Pro and i use Hermes agent as a harness connected to LM studio as obviously it’s preferable to be running MLX models especially… 4 arXiv — NLP / Computation & Language research 12d ago Continuous Audio Thinking for Large Audio Language Models arXiv:2606.18273v1 Announce Type: new Abstract: Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned… 37 arXiv — NLP / Computation & Language research 12d ago IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages arXiv:2606.19157v1 Announce Type: cross Abstract: AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge… 35 arXiv — NLP / Computation & Language research 12d ago FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs arXiv:2601.13836v2 Announce Type: replace Abstract: Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on… 35 Ars Technica — AI news-outlet 12d ago The Gemini-powered Google Home Speaker arrives on June 25 for $100 Google's new smart speaker is more about Gemini than audio quality. 27 TechCrunch — AI news-outlet 12d ago DeepL acquires Mixhalo for live-event audio streaming and translation With this acquisition, DeepL is opening an office in San Francisco to expand its U.S. business. 12 arXiv — NLP / Computation & Language research 13d ago NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama arXiv:2606.17391v1 Announce Type: new Abstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned,… 12 arXiv — NLP / Computation & Language research 13d ago ALAS: An Automatic Latent Alignment Score for Audio Language Models arXiv:2505.19937v3 Announce Type: replace Abstract: Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion… 17 r/LocalLLaMA community 13d ago I didn't know it was possible to compile llamacpp to run cuda + vulkan at the same time.. cmake -B build -G "Visual Studio 17 2022" -A x64 -DCUDAToolkit_ROOT="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" -DGGML_CUDA=ON -DGGML_VULKAN=ON -DGGML_FLASH_ATTN=ON -DGGML_BLAS=OFF -DGGML_NATIVE=OFF -DGGML_RPC=ON -DGGML_BACKEND_DL=ON… 31 Hugging Face Daily Papers research 13d ago MVEB: Massive Video Embedding Benchmark Abstract A large-scale video embedding benchmark evaluates diverse models across multiple video understanding tasks, revealing that different model architectures excel in specific domains and demonstrating the nuanced impact of audio on performance based on dataset… 7 arXiv — Machine Learning research 14d ago Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models arXiv:2606.15436v1 Announce Type: new Abstract: Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and… 28 arXiv — NLP / Computation & Language research 14d ago TMASC: Transmasculine Attitude and Speech Corpus arXiv:2606.16351v1 Announce Type: new Abstract: We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the… 25 Hugging Face Daily Papers research 14d ago TuneJury: An Open Metric for Improving Music Generation Preference Alignment Abstract A novel open-source pairwise reward model for text-to-music generation that provides calibrated preference scoring and generalizes across multiple downstream applications through a frozen reward mechanism. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce… 5 r/MachineLearning community 14d ago Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D] I'm trying to understand where people doing sensor based ML on microcontrollers (IMU, accelerometer, vibration ,that kind of time-series data) actually lose the most time. When you've built something like this, what was the bottleneck: Getting enough real world data in the first… 6 r/LocalLLaMA community 14d ago What do you guys think about Unsloth Studio? As a person who has gone through more AI frontend than one goes through socks, I have really appreciated the Unsloth frontend. It's anything I could ever need and it supports Diffusion Gemma! It has easy options to enable tensor parallelism and much more. Have you guys tried it… 33 r/LocalLLaMA community 14d ago I think we need a /LocalHarnessLLM or something ... LM Studio Hermes Qwen Code Odysseus Open Claw Open Code Claude Code (and then IDEs w/ agentic capabilities) Continue Rider VS Code And a dozen others I'm sure ... Would love a place to discuss these? If not a new subreddit, a new discord section in localllama discord? I've made… 24 arXiv — Machine Learning research 15d ago Beyond task performance: Decoding bioacoustic embeddings with speech features arXiv:2606.14662v1 Announce Type: new Abstract: Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species… 6 arXiv — NLP / Computation & Language research 15d ago The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models arXiv:2606.13993v1 Announce Type: new Abstract: A crucial aspect of linguistic capability is the ability to trade off between stored representations and abstract knowledge: one must retrieve learned representations, but also generate novel ones by applying productive rules.… 34 arXiv — NLP / Computation & Language research 15d ago AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization arXiv:2606.14694v1 Announce Type: new Abstract: Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and… 4 arXiv — NLP / Computation & Language research 15d ago Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources arXiv:2606.14141v1 Announce Type: cross Abstract: Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source… 12 Page 1 of 4 · 180 articles Older →