Tag

Music

181 articles archived under #music · RSS

Hugging Face Daily Papers research 28d ago

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Abstract StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage…

8
Smol AI News news-outlet 29d ago

not much happened today

**NVIDIA** led open-source AI model releases with **Cosmos 3**, a comprehensive omnimodal world model unifying language, image, video, audio, and action using a Mixture-of-Transformers design, and **Nemotron 3 Ultra**, a **550B** parameter open-weight model noted for high…

33
Hugging Face Daily Papers research 29d ago

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Abstract SwanSphere presents a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts using causal autoregressive diffusion transformers and multimodal learning strategies. AI-generated summary Real-time and accurate spatial…

25
r/LocalLLaMA community 29d ago

Llama Studio v0.2.0

I have made an update to my llama-server WebUI based on some awesome feedback and interaction with the community. 1) JSON model config replaced by per-model shell scripts. Run from CLI, paste from unsloth, email to your buddy or post to reddit: Using real shell scripts to store…

17
r/LocalLLaMA community 29d ago

<Think> toggle button for llama.cp web chat for QWEN3.6

https://preview.redd.it/od6suf6j7g4h1.png?width=619&format=png&auto=webp&s=d31fb903ea68f58e3a641bfd275d59eeb5cce445 Missing a button in llama-serve webchat to toggle reasoning on/off like in LM Studio? This is a snippet that runs in https://www.tampermonkey.net/ a browser…

34
r/LocalLLaMA community 1mo ago

Open source : Turning vocal imitations into sound effects. (New UX for sound generation)

Hello guys I want to introduce my new project! Have you ever needed a specific sound while making a video or a game? You know exactly what it sounds like in your head, but have no idea how to search for it. That’s why sound design meetings at game studios often turn into people…

12
r/LocalLLaMA community 1mo ago

this new Moss tts 1.5 is damn good with voice cloning

https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-v1.5 I prefer this over fish audio s2 pro because fish audio dont allow commercial use Long Cat DiT 3.5 is also a another good model.   submitted by   /u/9r4n4y [link]   [comments]

38
r/LocalLLaMA community 1mo ago

I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check.

Hot takes: - Mac studio is overpriced Raspberry Pi that is way more inefficient than people think (together with most macs). M5 MBP is better with the "tensor" MMA, but not by much. - Spark was actually decent when it was just 3-4k. Strix is obviously much better now - 3090 are…

26
r/LocalLLaMA community 1mo ago

Unsloth Studio updated to support training with MLX on macs

The title says it all. I noticed this morning when reviewing Unsloth Studio github that training with MLX is now fully supported. Not sure when this was added but must have been within the last couple of weeks since last I checked it said "coming soon." I haven't personally…

36
Hugging Face Daily Papers research 1mo ago

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Abstract ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages using diverse audio and speech models. AI-generated summary We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals…

15
arXiv — Machine Learning research 1mo ago

Auditing Training Data in Generative Music Models via Black-Box Membership Inference

arXiv:2605.29202v1 Announce Type: new Abstract: Recent advances in text-to-music generation enable high-fidelity synthesis of structured musical audio, raising growing concerns about data provenance, consent, and training transparency. These models are typically trained on…

29
arXiv — NLP / Computation & Language research 1mo ago

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

arXiv:2605.29300v1 Announce Type: new Abstract: Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored.…

8
llama.cpp releases dev-tools 1mo ago

b9393

mtmd: fix gemma 4 audio rms norm eps ( #23815 ) mtmd: fix gemma 4 audio rms norm eps Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret [email protected] Co-authored-by: Sigbjørn Skjæret [email protected] macOS/iOS: macOS Apple Silicon (arm64) macOS…

34
Hugging Face Daily Papers research 1mo ago

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Abstract OmniInteract presents a streaming benchmark for real-time omnimodal large language models that evaluates online audio-visual processing with temporal grounding and interactive response requirements. AI-generated summary We introduce OmniInteract, a streaming benchmark…

25
Hugging Face Daily Papers research 1mo ago

Native Audio-Visual Alignment for Generation

Abstract NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising. AI-generated summary Joint audio-video generation aims to synthesize temporally synchronized and…

38
arXiv — NLP / Computation & Language research 1mo ago

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

arXiv:2605.27741v1 Announce Type: new Abstract: Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural…

38
arXiv — NLP / Computation & Language research 1mo ago

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

arXiv:2605.27984v1 Announce Type: new Abstract: Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment…

10
TechCrunch — AI news-outlet 1mo ago

ElevenLabs’s new music generation model can switch genres mid-track

ElevenLabs' new model will let users regenerate a section of a song without affecting rest of the track

29
r/LocalLLaMA community 1mo ago

I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned

Howdy everyone! Quick disclosure: I work on this - it's a project my studio created called the Null Epoch. I wasn't really happy with testing my agents with the usual static benchmarks and I wanted to learn more about how models and agents handle long-horizon planning, resource…

25
r/MachineLearning community 1mo ago

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

If you've ever tried to pick an STT vendor for a phone-based voice agent or call center product, you've probably hit this wall: you have plenty of real production audio, but it's unlabeled, so you can't compute WER on it. And the annotated public datasets (FLEURS, CommonVoice,…

31
Hugging Face Daily Papers research 1mo ago

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Abstract Gemini Embedding 2 is a multimodal embedding model that generates unified representations for video, audio, image, and text data, achieving superior performance across diverse retrieval tasks and demonstrating strong zero-shot capabilities across specialized domains.…

18
arXiv — NLP / Computation & Language research 1mo ago

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

arXiv:2605.26978v1 Announce Type: new Abstract: Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target…

13
arXiv — NLP / Computation & Language research 1mo ago

Learning When to Think While Listening in Large Audio-Language Models

arXiv:2605.27190v1 Announce Type: new Abstract: Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until…

12
Hugging Face Daily Papers research 1mo ago

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Abstract LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences. AI-generated summary Audio-visual generation is rapidly advancing…

32
Hugging Face official-blog 1mo ago

Reachy Mini goes fully local

Back to Articles Reachy Mini goes fully local Published May 27, 2026 Update on GitHub Upvote 8 Amir Mahla A-Mahla Andres Marafioti andito After building your Reachy Mini, you'll install the conversation app and start talking to it. Until now, you had to send your audio to a…

20
r/LocalLLaMA community 1mo ago

Llamacpp server : How do the -np and -c flags interact?

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact. The context for each parallel client appears to be equally distributed…

10
arXiv — NLP / Computation & Language research 1mo ago

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

arXiv:2605.23954v1 Announce Type: new Abstract: Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement,…

36
arXiv — NLP / Computation & Language research 1mo ago

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

arXiv:2605.23975v1 Announce Type: new Abstract: Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission,…

14
arXiv — NLP / Computation & Language research 1mo ago

Locality Matters for Training-Free Audio Token Compression in Audio-Language Models

arXiv:2605.25179v1 Announce Type: new Abstract: Audio-language models (ALMs) are increasingly used for audio captioning, question answering, and open-ended audio understanding, but their inference cost remains high when audio inputs are represented as long prefix-token…

12
Hugging Face Daily Papers research 1mo ago

StepAudio 2.5 Technical Report

Abstract StepAudio 2.5 is a unified audio-language model that matches specialized systems in ASR, TTS, and real-time spoken interaction by using task-tailored reinforcement learning from human feedback to optimize shared representations across different operational modes.…

12
r/LocalLLaMA community 1mo ago

Qwen 3.6 27B MTP speed on 3080ti (getting 4.5 t/s)

Using LM Studio with 3080ti (12gb of VRAM) and 128gb of ddr4. Model version: Qwen 3.6 27B MTP UD q4_k_xl Is this my hardware limit? Is there anyway to speed this up using the current hardware?   submitted by   /u/yehiaserag [link]   [comments]

35
r/LocalLLaMA community 1mo ago

qwen3.6-35b-a3b-mtp running on GTX 1060 6GB

I have this old 10-year old Dell T5810 workstation with 32GB ddr3(?) memory and a E5-2698v3 (16 cores 32 threads), a GTX 1060 6GB that's used for mining back in the old days (paid itself back many times over). I managed to get the model running with LMStudio in Windows(!). My…

13
Hacker News — AI on Front Page community 1mo ago

Show HN: Audiomass – a free, open-source multitrack audio editor for the web

Article URL: https://audiomass.co/?multitrack=1 Comments URL: https://news.ycombinator.com/item?id=48258015 Points: 338 # Comments: 68

29
Hacker News — AI on Front Page community 1mo ago

BambuStudio has been violating PrusaSlicer AGPL license since their fork

Article URL: https://xcancel.com/josefprusa/status/2054602354851254330 Comments URL: https://news.ycombinator.com/item?id=48245862 Points: 249 # Comments: 96

5
r/LocalLLaMA community 1mo ago

meituan-longcat/LongCat-Video-Avatar-1.5 · Hugging Face

🚀 Model Introduction We are excited to announce the release of LongCat-Video-Avatar 1.5, an upgraded open-source framework that prioritizes extreme empirical optimization and production-readiness for audio-driven human video generation. Built upon the LongCat-Video foundation…

21
Ars Technica — AI news-outlet 1mo ago

US scrambles to stop Internet users re-creating dead pilots’ voices

Workaround flouts law that bans NTSB disclosures of cockpit audio recordings.

13
Hugging Face Daily Papers research 1mo ago

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

Abstract Audio diffusion models are adapted for interactive music generation through efficient block-wise processing and novel training paradigms that enable real-time performance on consumer hardware. AI-generated summary Interactive streaming music generation promises the use…

11
Hugging Face Daily Papers research 1mo ago

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Abstract LatentOmni is a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states using feature-level supervision and temporal consistency embedding, outperforming explicit text-based chain-of-thought approaches in audio-visual reasoning…

18
r/MachineLearning community 1mo ago

Live Human Detector on Outbound Phone Calls [R]

Goal To save humans wasting time sitting in Call Centre queues waiting to be answered To have tool listen in on the audio stream of a live call, post IVR Navigation - to determine whether the call has transitioned out of the queue and to a live person. Requirements The tool must…

20
arXiv — NLP / Computation & Language research 1mo ago

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

arXiv:2605.22012v1 Announce Type: new Abstract: Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is…

7
llama.cpp releases dev-tools 1mo ago

b9279

vulkan: fuse snake activation (mul, sin, sqr, mul, add) ( #22855 ) vulkan: fuse snake activation (mul, sin, sqr, mul, add) Add snake.comp shader with F32 / F16 / BF16 pipelines and ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op decomposition emitted by audio…

23
r/LocalLLaMA community 1mo ago

In theory, if I have $20k-ish to spend on hardware what would actually get me closest to local coding agent that would allow me to go totally off the social grid?

Let's say I'm in the market to buy a studio or RTX 6000's. At what point am I off the grid with a local coding agent? Probably a model question too.   submitted by   /u/Tired__Dev [link]   [comments]

17
LangChain releases dev-tools 1mo ago

langchain-openai==1.2.2

Changes since langchain-openai==1.2.1 release(openai): 1.2.2 ( #37617 ) chore(infra): bump langchain-tests floor to 1.1.9 ( #37610 ) test(openai): unbreak audio chat and Azure embedding integration tests ( #37589 ) fix(openai): guard httpx finalizers ( #37570 ) chore: bump…

9
r/LocalLLaMA community 1mo ago

Best solution to generate reports locally with graphs, charts? Beginner question.

So on a local lm like ollama, or lm studio etc. you can run questions and prompts. But it’s a text response and I am unable to have it generate pdfs or report files graphs. Such as a pie chart on my invoices. Or create a report for me on statistics. When I run kimi, or Claude…

11
r/LocalLLaMA community 1mo ago

Strix Halo 128GB vs M5 pro 64GB

What would you pick if they were at the same/similar price, say around $3000 (Macbook pro 16" vs laptop at a little more or even Mini PC at a little less like $2500). Has someone tried both in terms of speed? I use LM studio. I tend to prefer MacOS because of Drawthings, which…

18
TechCrunch — AI news-outlet 1mo ago

Spotify launches an ElevenLabs-powered audiobook creation tool

Spotify is releasing new audiobook plans later this year

20
Hugging Face Daily Papers research 1mo ago

Stable Audio 3

Abstract Stable Audio 3 enables efficient variable-length audio generation and editing through latent diffusion models operating on a semantic-acoustic autoencoder, with adversarial post-training for improved speed and quality. AI-generated summary Stable Audio 3 is a family of…

19
Hugging Face Daily Papers research 1mo ago

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Abstract Large Audio Language Models exhibit significant trustworthiness challenges despite performance advances, requiring comprehensive frameworks addressing security vulnerabilities and defensive strategies. AI-generated summary The foundational capabilities established by…

31
TechCrunch — AI news-outlet 1mo ago

Stability AI releases a new audio model that can create six-minute songs

Stability Audio 3.0 small model can run on-device and generate two-minute long tracks

21
r/LocalLLaMA community 1mo ago

"AWS secures rare Mac Studios while ordinary Apple customers remain completely locked out"

https://www.techradar.com/pro/you-cant-buy-them-for-your-home-or-office-but-aws-just-snapped-up-a-host-of-apples-most-highly-desired-m3-ultra-macs Let them eat cloud!   submitted by   /u/openSourcerer9000 [link]   [comments]

20

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

not much happened today

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Llama Studio v0.2.0

<Think> toggle button for llama.cp web chat for QWEN3.6

Open source : Turning vocal imitations into sound effects. (New UX for sound generation)

this new Moss tts 1.5 is damn good with voice cloning

I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check.

Unsloth Studio updated to support training with MLX on macs

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Auditing Training Data in Generative Music Models via Black-Box Membership Inference

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

b9393

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Native Audio-Visual Alignment for Generation

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

ElevenLabs&#8217;s new music generation model can switch genres mid-track

I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

Learning When to Think While Listening in Large Audio-Language Models

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Reachy Mini goes fully local

Llamacpp server : How do the -np and -c flags interact?

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

Locality Matters for Training-Free Audio Token Compression in Audio-Language Models

StepAudio 2.5 Technical Report

Qwen 3.6 27B MTP speed on 3080ti (getting 4.5 t/s)

qwen3.6-35b-a3b-mtp running on GTX 1060 6GB

Show HN: Audiomass – a free, open-source multitrack audio editor for the web

BambuStudio has been violating PrusaSlicer AGPL license since their fork

meituan-longcat/LongCat-Video-Avatar-1.5 · Hugging Face

US scrambles to stop Internet users re-creating dead pilots’ voices

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Live Human Detector on Outbound Phone Calls [R]

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

b9279

In theory, if I have $20k-ish to spend on hardware what would actually get me closest to local coding agent that would allow me to go totally off the social grid?

langchain-openai==1.2.2

Best solution to generate reports locally with graphs, charts? Beginner question.

Strix Halo 128GB vs M5 pro 64GB

Spotify launches an ElevenLabs-powered audiobook creation tool

Stable Audio 3

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Stability AI releases a new audio model that can create six-minute songs

"AWS secures rare Mac Studios while ordinary Apple customers remain completely locked out"

ElevenLabs’s new music generation model can switch genres mid-track