Tag

Music

181 articles archived under #music · RSS

r/LocalLLaMA community 12h ago

What's the full local AI "doomsday prepper" kit for cold storage? 16-bit safetensors of LLMs (obv), copies/source codes of Llama.cpp, ComfyUI, vLLM, Kobold, LMStudio, etc, macOS, Linux OSes, Windows 10&11, etc, Rufus (including older ones), various VMs, P-E-W's Heretic/Grimoire,…

For those who want to be as paranoid and maximally doomsday prepped as possible, I am curious what the most thorough "doomsday kit" is of things to store offline copies of "just in case", to still be able to use local AI if things go truly crazy to a super extreme level. So far…

23
Vercel — AI dev-tools 23h ago

Build realtime voice agents on AI Gateway

AI Gateway now supports audio/voice. You can add realtime voice, text to speech, and speech to text with the same calls you already use for text, image, and video, routed through AI Gateway alongside every other modality. Audio launches with models from OpenAI and xAI . Each…

26
arXiv — Machine Learning research 1d ago

HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models

arXiv:2606.27627v1 Announce Type: new Abstract: Discrete audio representations have become increasingly popular for building multimodal text-audio systems and integrating audio capabilities into Large Language Models (LLMs). However, numerous studies report performance…

7
arXiv — Machine Learning research 1d ago

A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset

arXiv:2606.27886v1 Announce Type: new Abstract: Recent advances in Human Activity Recognition (HAR) from wearable sensors have shown that multi-modal deep learning models consistently outperform their uni-modal counterparts. Modalities can include IMUs, RGB cameras, audio…

27
arXiv — Machine Learning research 1d ago

Elastic Time: Dynamic Frame Rate Bottlenecks for Neural Audio Coding

arXiv:2606.27320v1 Announce Type: cross Abstract: Neural audio autoencoders have become a core component of compression, feature extraction, and generation. However, while existing systems support variable bitrate, the vast majority of models still operate at a fixed latent…

38
Vercel — AI dev-tools 1d ago

Realtime voice, speech, and transcription now supported on AI Gateway

AI Gateway now supports voice and audio models. You can build realtime voice agents, generate speech from text, and transcribe audio to text. This provides the same observability, spend controls, and bring-your-own-key support as text, image, and video models in AI Gateway, with…

17
Vercel — AI dev-tools 1d ago

xAI Grok audio models now available on Vercel AI Gateway

xAI's audio models are now live on AI Gateway. Realtime voice, text to speech, and speech to text are all available through the AI SDK with the same routing, observability, and spend controls as your other models. These capabilities are available on the AI SDK 7 release.…

11
r/LocalLLaMA community 2d ago

Any better models in coding for single dgx spark in near future?

I’m an owner of single dgx spark with 128 gb unified memory. and I’m hosting through all my local network my ppm over lmstudio. I’m mainly using it for coding,some long document sorting tasks and some sequruty testing. my favorite rn is stepfun step-3.7-flash q3 xxl it’s a bit…

32
r/LocalLLaMA community 4d ago

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

I’ve been working on audio.cpp , a native C++ inference framework for audio models built on top of ggml. The framework currently has 25 model families, but I want to be precise about its state: 12 are released in the repo now and ready for normal use. I’m not counting anything…

24
r/MachineLearning community 4d ago

Looking for arXiv endorsement (eess.AS or cs.SD) [R]

Hi, I'm an undergrad researcher looking for an arXiv endorsement to submit my first paper in the audio/speech processing domain (keyword spotting on microcontrollers). I've submitted to a peer-reviewed IEEE conference and am awaiting results, but want to get a preprint up in the…

26
Vercel — AI dev-tools 4d ago

AI SDK 7 is now available

AI SDK 7 is a major release for building production agents in TypeScript. The SDK has grown from model calls and chat primitives into a broader agent platform for developing, running, integrating, and observing agents across text, audio, realtime, image, and video. Every major…

8
Hugging Face Daily Papers research 5d ago

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Abstract UnityShots is a memory-driven audio-video generation system that maintains consistent subject appearance and audio across video cuts using fixed-size long-term and short-term memory slots with boundary-conditioned gates and discrete cut-type priors. Generated by…

7
arXiv — NLP / Computation & Language research 5d ago

Robustness assessment of large audio language models in multiple-choice evaluation

arXiv:2510.04584v2 Announce Type: replace Abstract: Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in…

13
Hugging Face Daily Papers research 5d ago

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Abstract Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present…

20
arXiv — NLP / Computation & Language research 6d ago

AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

arXiv:2606.24286v1 Announce Type: new Abstract: Multimodal Large Language Models have achieved remarkable progress in short-form audio-video understanding, yet long-form audio-video comprehension remains challenged by limited context windows and severe information redundancy. To…

15
arXiv — NLP / Computation & Language research 6d ago

Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams

arXiv:2606.24523v1 Announce Type: new Abstract: Scam phone calls exploit vulnerable communities worldwide, yet research on detection has focused almost exclusively on English and other high-resource languages. In low-resource settings such as Turkish, detection is especially…

11
arXiv — NLP / Computation & Language research 6d ago

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained…

15
r/LocalLLaMA community 6d ago

GLM 5.2 on Mac Studio Speedup PR

Just a heads up for the lucky few 512 gb mac owners: GLM 5.2 is a game changer because prefill speeds stay above 100 t/s at much higher context, and also take less space, so we can run 4 bit quants well above 100k context. See this PR by the oMLX creator:…

5
Hugging Face Daily Papers research 6d ago

Libretto: Giving LLM Agents a Sense of Musical Structure

Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from…

18
r/LocalLLaMA community 6d ago

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

Ran three open-weight TTS models head to head on CPU. Intel Xeon, 4 cores, 15.6GB RAM, no GPU. Five configs, six text lengths from 12 to 1712 chars, 5 timed reps per cell after warmup, 150 timed runs total. Every audio output scored with UTMOS (utmos22_strong) so quality isn't…

19
Hugging Face Daily Papers research 6d ago

Improving Text-to-Music Generation with Human Preference Rewards

Abstract A text-to-music generation system uses reward conditioning, expert iteration, and preference tuning to improve audio quality while maintaining efficiency within a 120M-parameter model framework. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We describe our entry to the…

19
r/LocalLLaMA community 7d ago

EU AI Act requires TEXT from models and providers to be watermarked 2nd August onwards. Everyone here is affected, regardless where you live.

Anyone hate the cookie banners ? Those are absolutely nothing in comparison to what is about to come. The AI Act requires lots of things, many people know it requires every AI modified or generated audiofile to be metadata tagged and fingerprint-watermarked from August on (32M$…

9
r/MachineLearning community 7d ago

Recommendations for speech annotation tools [D]

I'm looking for human-in-the-loop platforms that allow you to automatically transcribe audio followed by manually fixing the transcriptions and fine tuning the model. Is there a local (not an online service) installable platform for doing this?   submitted by  …

11
r/LocalLLaMA community 9d ago

Qwen code companion on vscode marketplace - thoughts

I just came across this extension in vscode few days ago and tried to use with LM studio hosted models and it really is pretty good compared to `continue`, `kilo`, `cline`, `roo` like I felt without much tweaks, gets straight to the point, if any tweaks required u could do…

36
r/LocalLLaMA community 10d ago

Local agent on 4090 - looking for LM Studio settings

I have moved on from Ollama to just dink around and instead want to start running a local agent from time to time. With the 24GB of a 4090 (Gigabyte OC edition) that should be quite possible. But no matter what settings I use for context and batching, token generation is slow as…

36
r/LocalLLaMA community 10d ago

Single RTX 3090 (MSI TRio) giving trouble on inference.

Hi, I'm having weird issues with my 3090 on inferencerence via lmstudio , it just: unloads the model/ model crashes + nvidia driver resets freezes the pc gives blue/black screen and the computer restarts or straight up restarts everything. I tried running it regularly,…

33
r/LocalLLaMA community 10d ago

Best Harness for Web Searching

Looking for opinions on the best software to do web searching resources. What I've tried: LM Studio + plugins Odysseus I think the problem they're both running into is the search engines they're using max out at like, 10 requests per day/hour or something without an api. I don't…

17
Hugging Face Daily Papers research 10d ago

Duration Aware Scheduling for ASR Serving Under Workload Drift

Abstract Duration-aware scheduling policies improve ASR serving latency by leveraging audio length as a predictor for processing time, with SJF and HRRN algorithms showing significant median latency reductions while maintaining throughput. Generated by…

26
r/LocalLLaMA community 10d ago

GLM-5.2 can now run locally in llama.cpp and Unsloth Studio.

The 2-bit model retains ~82% accuracy after we shrunk it from 1.51TB to 238GB (-84% size). Run on a 256GB Mac or RAM/VRAM setups. GLM-5.2 is the strongest open model to date. Check the graph for the accuracy of each GLM-5.2-GGUF quantization. Full guide:…

35
arXiv — NLP / Computation & Language research 11d ago

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

arXiv:2606.20179v1 Announce Type: new Abstract: Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial…

21
Hugging Face Daily Papers research 11d ago

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Abstract MaineCoon represents the first real-time audio-visual autoregressive model for social worlds, achieving high frame rates and long-horizon generation through novel training techniques and inference frameworks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As an increasing…

21
r/LocalLLaMA community 12d ago

I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it?

Yes I know this is a simple question I could just ask Claude or something but I want to see what the community suggests For context it’s a 16in MacBook Pro and i use Hermes agent as a harness connected to LM studio as obviously it’s preferable to be running MLX models especially…

4
arXiv — NLP / Computation & Language research 12d ago

Continuous Audio Thinking for Large Audio Language Models

arXiv:2606.18273v1 Announce Type: new Abstract: Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned…

37
arXiv — NLP / Computation & Language research 12d ago

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

arXiv:2606.19157v1 Announce Type: cross Abstract: AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge…

35
arXiv — NLP / Computation & Language research 12d ago

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

arXiv:2601.13836v2 Announce Type: replace Abstract: Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on…

35
Ars Technica — AI news-outlet 12d ago

The Gemini-powered Google Home Speaker arrives on June 25 for $100

Google's new smart speaker is more about Gemini than audio quality.

27
TechCrunch — AI news-outlet 12d ago

DeepL acquires Mixhalo for live-event audio streaming and translation

With this acquisition, DeepL is opening an office in San Francisco to expand its U.S. business.

12
arXiv — NLP / Computation & Language research 13d ago

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

arXiv:2606.17391v1 Announce Type: new Abstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned,…

12
arXiv — NLP / Computation & Language research 13d ago

ALAS: An Automatic Latent Alignment Score for Audio Language Models

arXiv:2505.19937v3 Announce Type: replace Abstract: Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion…

17
r/LocalLLaMA community 13d ago

I didn't know it was possible to compile llamacpp to run cuda + vulkan at the same time..

cmake -B build -G "Visual Studio 17 2022" -A x64 -DCUDAToolkit_ROOT="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" -DGGML_CUDA=ON -DGGML_VULKAN=ON -DGGML_FLASH_ATTN=ON -DGGML_BLAS=OFF -DGGML_NATIVE=OFF -DGGML_RPC=ON -DGGML_BACKEND_DL=ON…

31
Hugging Face Daily Papers research 13d ago

MVEB: Massive Video Embedding Benchmark

Abstract A large-scale video embedding benchmark evaluates diverse models across multiple video understanding tasks, revealing that different model architectures excel in specific domains and demonstrating the nuanced impact of audio on performance based on dataset…

7
arXiv — Machine Learning research 14d ago

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

arXiv:2606.15436v1 Announce Type: new Abstract: Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and…

28
arXiv — NLP / Computation & Language research 14d ago

TMASC: Transmasculine Attitude and Speech Corpus

arXiv:2606.16351v1 Announce Type: new Abstract: We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the…

25
Hugging Face Daily Papers research 14d ago

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Abstract A novel open-source pairwise reward model for text-to-music generation that provides calibrated preference scoring and generalizes across multiple downstream applications through a frozen reward mechanism. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce…

5
r/MachineLearning community 14d ago

Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]

I'm trying to understand where people doing sensor based ML on microcontrollers (IMU, accelerometer, vibration ,that kind of time-series data) actually lose the most time. When you've built something like this, what was the bottleneck: Getting enough real world data in the first…

6
r/LocalLLaMA community 14d ago

What do you guys think about Unsloth Studio?

As a person who has gone through more AI frontend than one goes through socks, I have really appreciated the Unsloth frontend. It's anything I could ever need and it supports Diffusion Gemma! It has easy options to enable tensor parallelism and much more. Have you guys tried it…

33
r/LocalLLaMA community 14d ago

I think we need a /LocalHarnessLLM or something ...

LM Studio Hermes Qwen Code Odysseus Open Claw Open Code Claude Code (and then IDEs w/ agentic capabilities) Continue Rider VS Code And a dozen others I'm sure ... Would love a place to discuss these? If not a new subreddit, a new discord section in localllama discord? I've made…

24
arXiv — Machine Learning research 15d ago

Beyond task performance: Decoding bioacoustic embeddings with speech features

arXiv:2606.14662v1 Announce Type: new Abstract: Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species…

6
arXiv — NLP / Computation & Language research 15d ago

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

arXiv:2606.13993v1 Announce Type: new Abstract: A crucial aspect of linguistic capability is the ability to trade off between stored representations and abstract knowledge: one must retrieve learned representations, but also generate novel ones by applying productive rules.…

34
arXiv — NLP / Computation & Language research 15d ago

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

arXiv:2606.14694v1 Announce Type: new Abstract: Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and…

4

What's the full local AI "doomsday prepper" kit for cold storage? 16-bit safetensors of LLMs (obv), copies/source codes of Llama.cpp, ComfyUI, vLLM, Kobold, LMStudio, etc, macOS, Linux OSes, Windows 10&11, etc, Rufus (including older ones), various VMs, P-E-W's Heretic/Grimoire,…

Build realtime voice agents on AI Gateway

HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models

A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset

Elastic Time: Dynamic Frame Rate Bottlenecks for Neural Audio Coding

Realtime voice, speech, and transcription now supported on AI Gateway

xAI Grok audio models now available on Vercel AI Gateway

Any better models in coding for single dgx spark in near future?

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

Looking for arXiv endorsement (eess.AS or cs.SD) [R]

AI SDK 7 is now available

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Robustness assessment of large audio language models in multiple-choice evaluation

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

GLM 5.2 on Mac Studio Speedup PR

Libretto: Giving LLM Agents a Sense of Musical Structure

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

Improving Text-to-Music Generation with Human Preference Rewards

EU AI Act requires TEXT from models and providers to be watermarked 2nd August onwards. Everyone here is affected, regardless where you live.

Recommendations for speech annotation tools [D]

Qwen code companion on vscode marketplace - thoughts

Local agent on 4090 - looking for LM Studio settings

Single RTX 3090 (MSI TRio) giving trouble on inference.

Best Harness for Web Searching

Duration Aware Scheduling for ASR Serving Under Workload Drift

GLM-5.2 can now run locally in llama.cpp and Unsloth Studio.

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it?

Continuous Audio Thinking for Large Audio Language Models

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

The Gemini-powered Google Home Speaker arrives on June 25 for $100

DeepL acquires Mixhalo for live-event audio streaming and translation

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

ALAS: An Automatic Latent Alignment Score for Audio Language Models

I didn't know it was possible to compile llamacpp to run cuda + vulkan at the same time..

MVEB: Massive Video Embedding Benchmark

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

TMASC: Transmasculine Attitude and Speech Corpus

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]

What do you guys think about Unsloth Studio?

I think we need a /LocalHarnessLLM or something ...

Beyond task performance: Decoding bioacoustic embeddings with speech features

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization