Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I'm trying to use Gemma 4 12B — the new encoder-free unified model (audio/vision/text in one) — for a one-pass audio → response voice assistant: feed the recorded WAV + system prompt and get the reply back as text directly, collapsing the separate ASR + LLM steps into a single model (TTS still happens afterward).
Works great with a minimal prompt — the model clearly hears and responds to the audio. But once the text prompt gets large/dense (mine is ~21k tokens: detailed instructions + tool definitions), it basically stops attending to the audio — replies as if the audio weren't there (generic/hallucinated) or only weakly transcribes. Trim the prompt back down and audio attention returns.
Same behavior across three stacks, so it doesn't look stack-specific:
- vLLM (gemma4-unified image + pip install av), audio as base64 audio_url
- llama.cpp (--mmproj, input_audio content, chat_template_kwargs {enable_thinking:false})
- LiteRT-LM (gemma4-12b,gpu)
Feels like an inherent attention/saturation limit when audio competes with a long dense text context. (Notably, E4B with a tiny prompt keeps audio attention fine — so I'm using it as a small audio front-end instead.)
Questions for anyone who's tried:
1. Has anyone gotten 12B unified audio to reliably attend to speech with a big system prompt (lots of instructions/tools)?
Known limitation of the unified arch, or a serving/config thing (audio placement in the sequence, attention settings, chat template, sampling)?
Workarounds — audio-first vs audio-last ordering, prompt structuring, attention/RoPE tweaks?
Served on an NVIDIA GB10 (Blackwell).
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.