News / #music Tag Music 181 articles archived under #music · RSS Sign in to follow Hugging Face Daily Papers research 1mo ago Stage-adaptive Token Selection for Efficient Omni-modal LLMs Abstract SEATS is a training-free, stage-adaptive token selection method that reduces computational overhead in om-LLMs by progressively pruning redundant visual and audio tokens during both pre-LLM and LLM stages. AI-generated summary Omni-modal large language models (om-LLMs)… 15 Hugging Face Daily Papers research 1mo ago When Vision Speaks for Sound Abstract Video-capable multimodal large language models exhibit apparent audio understanding driven by visual cues rather than actual audio processing, necessitating intervention-based frameworks for diagnosing and improving audio-visual alignment. AI-generated summary Despite… 34 arXiv — Machine Learning research 1mo ago Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition arXiv:2605.18884v1 Announce Type: new Abstract: Multimodal emotion recognition aims to integrate text, audio, and video sources to understand human affective states. Although multimodal large language models excel at multimodal reasoning, they typically treat emotion categories… 15 arXiv — NLP / Computation & Language research 1mo ago Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation arXiv:2605.19833v1 Announce Type: cross Abstract: Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic… 37 r/LocalLLaMA community 1mo ago LM Studio finally added support for MTP Speculative Decoding https://preview.redd.it/1uuzjm0ll72h1.png?width=923&format=png&auto=webp&s=1af7d7594be1e08ff7ad6797e2bc53e9410769a3 update to 0.4.14 Build 2 (Beta) and make sure your llama.cpp engine is 2.15.0… 38 Hugging Face Daily Papers research 1mo ago MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation Abstract MSAVBench presents the first comprehensive benchmark and adaptive evaluation framework for multi-shot audio-video generation, addressing limitations in existing benchmarks through diverse task settings and advanced evaluation mechanisms. AI-generated summary Video… 25 Hugging Face Daily Papers research 1mo ago OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments Abstract OmniGUI presents a novel multimodal benchmark for GUI agents that incorporates simultaneous audio, video, and image inputs to better simulate real smartphone interactions. AI-generated summary Current benchmarks for graphical user interface (GUI) agents predominantly… 9 r/LocalLLaMA community 1mo ago Why is LM-Studio download page showing me 0.4.7 to download when the latest version is 0.4.13? I'm currently running LM-Studio 0.4.12. In the app if I check for updates it says there's a new version (0.4.13), I can read the changelog for 0.4.13, but when I go to https://lmstudio.ai/download it shows 0.4.7. What's going on here? Anyone knows?   submitted by  … 37 TechCrunch — AI news-outlet 1mo ago Google takes a page out of Meta’s book, announces new audio-powered smart glasses Google is calling the new devices "audio glasses," in that users will be able to issue verbal commands to them and get things done via its ecosystem of apps and services, including Gemini. 7 TechCrunch — AI news-outlet 1mo ago Google takes a page out of Meta’s book, announces new audio-powered smart glasses at IO 2026 Google is calling the new devices "audio glasses," in that users will be able to issue verbal commands to them and get things done via its ecosystem of apps and services, including Gemini. 4 TechCrunch — AI news-outlet 1mo ago Google’s AI Studio now lets anyone build Android apps in minutes Google unveiled new web-based AI tools that can generate native Android apps in minutes, as the company expands its push into AI-powered software development. 34 TechCrunch — AI news-outlet 1mo ago Google’s Gemini Omni turns images, audio, and text into video — and that’s just the start Google's Gemini Omni is a new multimodal model that reasons across text, images, audio, and video to generate and edit videos through simple conversation — starting with Omni Flash. 4 r/LocalLLaMA community 1mo ago Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates Disclosure: I made this. Open-source, MIT, Windows + Linux. Not affiliated with voiceflow.com (the chatbot SaaS, name collision, sorry). Why this exists: I wanted local-only dictation and meeting transcription, because audio shouldn't have to leave the machine just to become… 13 r/LocalLLaMA community 1mo ago Audio upscaling, cleanup, or improvement models? I never see this type of model talked about. Are there many open models in the category? I do a lot of audio cleanup and end up using auphonic but would like to be using a local model. Edit: e.g like voice recovery, reverb removal, auto-EQ type stuff   submitted by  … 5 arXiv — NLP / Computation & Language research 1mo ago Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech arXiv:2605.17652v1 Announce Type: new Abstract: There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows… 10 r/MachineLearning community 1mo ago Architecture advice: Real-time pipeline for YouTube Audio -> Whisper -> LLM -> SSE (Sub-10s latency) [D] Hey everyone, I’m building a backend that analyzes long YouTube videos using an LLM. Currently, my flow is a slow waterfall: Download full audio -> Whisper -> LLM -> Return results . For a 30-minute video, the user waits forever. I want to pipeline this for real-time SSE… 5 Hugging Face Daily Papers research 1mo ago AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting Abstract AuralSAM2 integrates audio into SAM2 through an AuralFuser module that generates sparse and dense prompts, enhancing cross-modal influence while maintaining interactive segmentation efficiency. AI-generated summary Segment Anything Model 2 (SAM2) exhibits strong… 18 arXiv — NLP / Computation & Language research 1mo ago Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation arXiv:2505.18853v2 Announce Type: replace Abstract: Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion… 23 r/LocalLLaMA community 1mo ago Looking to migrate off of Ollama and LMStudio Hello, I'm currently using Ollama / lm studio for things like code inference and proof reading emails, etc. Definitely not experienced in this space but looking to grow. It's been working great but it's a bit slow at times. I use Gemma 4 / Qwen, I also recently tried using… 22 r/LocalLLaMA community 1mo ago GitHub - richardr1126/openreader: An open-source read-along document reader server with high-quality TTS options, synchronized highlighting, and audiobook export for EPUB, PDF, DOCX, TXT, and MD. Sharing my latest release of OpenReader v3.0.0, an open-source text-to-speech document reader and audiobook exporter. It has been live for over a year now, and slowly has gained 300+ GitHub stars. What is OpenReader? A Next.js web app for reading and listening to EPUB, PDF, TXT,… 9 r/LocalLLaMA community 1mo ago Audio input not accepted with llamacpp for Nemotron 3 nano Omni ? Llama-server does not accept audio input (or video for that matter) with Nemotron 3 nano omni (unsloth). I’m on a recent build of llamacpp and I redownloaded Nemotron, and I have the mmproj loaded too. It still accepts images, but not audio, in fact the audio input option on the… 35 llama.cpp releases dev-tools 1mo ago b9169 mtmd: add chunks and fix preproc for qwen3a ( #23073 ) mtmd: add chunks and fix preproc for qwen3a add attn_mask limit mtmd_chunk size (avoid blow up memory) correct audio tokens re-order the set_input case remove attn_mask macOS/iOS: macOS Apple Silicon (arm64) macOS Apple… 7 r/LocalLLaMA community 1mo ago Adding E4B audio encoder to larger models I am curious if anyone here has tried doing this, I did a bit of digging and it seems like it would be easier to do then I first thought and would like to ask ask for correction if my assumptions are wrong. Here is how I would go about it: Extract the 300mb audio encoder from… 22 r/LocalLLaMA community 1mo ago Qwen 3.6 27B: IQ3XXS KV Q8 vs Q4XL KV Q4 (262K context) hey yall. So I have a 24GB gpu. What do you think is better? I am using unsloth quants. Both are UD quants. I need 262K context for my hermes agent and use case. Both setups fit perfectly in vram. I have heard that Qwen 3.6 27B is quite good even with Q4 KV. I am using LM studio… 27 arXiv — Machine Learning research 1mo ago AudioMosaic: Contrastive Masked Audio Representation Learning arXiv:2605.14231v1 Announce Type: new Abstract: Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches… 11 arXiv — NLP / Computation & Language research 1mo ago From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents arXiv:2605.15104v1 Announce Type: new Abstract: Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling… 19 r/LocalLLaMA community 1mo ago Llama-Studio, WebUI for llama-server Management Hey all, I have built myself a WebUI for configuring and managing llama-server sessions, and want to share the code and concept. Python and a bit of JS. Hack away! Local only. https://github.com/m94301/llama-studio The major use case is running various instances of llama-server… 11 r/LocalLLaMA community 1mo ago Scenema Audio: Zero-shot expressive voice cloning and speech generation We've been building Scenema Audio as part of our video production platform at scenema.ai, and we're releasing the model weights and inference code. The core idea: emotional performance and voice identity are independent. You describe how the speech should be performed (rage,… 17 Hugging Face Daily Papers research 1mo ago Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition Abstract Research identifies studio-bias in multilingual ASR fine-tuning and proposes R-MFT method to improve spontaneous speech performance while maintaining efficiency. AI-generated summary Fine-tuning multilingual ASR models like Whisper for low-resource languages often… 20 r/LocalLLaMA community 1mo ago running Qwen 3.6 35b A3B on 2x 5060TI i ran Qwen 3.6 35b A3B two 5060TI 16gb ( 32 gb vram also i have 32gb dram but i don't like offloading ) i used Q4 on LM Studio to get full context and i get 90t/s any tricks to optimze this more to upgrade to Q6 or Q8 ? thanks ! another thing if you recommend somthing for… 11 r/MachineLearning community 1mo ago Scenema Audio: Zero-shot expressive voice cloning and speech generation [N] We've been building Scenema Audio as part of our video production platform at scenema.ai, and we're releasing the model weights and inference code. The core idea: emotional performance and voice identity are independent. You describe how the speech should be performed (rage,… 37 Page 4 of 4 · 181 articles ← Newer