r/LocalLLaMA · June 10, 2026 · 1 min read

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I'm trying to use Gemma 4 12B — the new encoder-free unified model (audio/vision/text in one) — for a one-pass audio → response voice assistant: feed the recorded WAV + system prompt and get the reply back as text directly, collapsing the separate ASR + LLM steps into a single model (TTS still happens afterward).

Works great with a minimal prompt — the model clearly hears and responds to the audio. But once the text prompt gets large/dense (mine is ~21k tokens: detailed instructions + tool definitions), it basically stops attending to the audio — replies as if the audio weren't there (generic/hallucinated) or only weakly transcribes. Trim the prompt back down and audio attention returns.

Same behavior across three stacks, so it doesn't look stack-specific:

- vLLM (gemma4-unified image + pip install av), audio as base64 audio_url

- llama.cpp (--mmproj, input_audio content, chat_template_kwargs {enable_thinking:false})

- LiteRT-LM (gemma4-12b,gpu)

Feels like an inherent attention/saturation limit when audio competes with a long dense text context. (Notably, E4B with a tiny prompt keeps audio attention fine — so I'm using it as a small audio front-end instead.)

Questions for anyone who's tried:

1. Has anyone gotten 12B unified audio to reliably attend to speech with a big system prompt (lots of instructions/tools)?

Known limitation of the unified arch, or a serving/config thing (audio placement in the sequence, attention settings, chat template, sampling)?
Workarounds — audio-first vs audio-last ordering, prompt structuring, attention/RoPE tweaks?

Served on an NVIDIA GB10 (Blackwell).

submitted by /u/Think_Illustrator188
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA