r/LocalLLaMA · · 1 min read

Watching a local AI voice assistant get dumber (A 9B to 0.8B agent experiment on my RTX 5060 Ti)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I wanted to find the exact floor for running an intelligent, local voice assistant agent on consumer hardware.

I kept the environment, tools, and prompts identical, I stepped the model sizes down through Qwen 3.5 9B, 4B, 2B, and 0.8B to see how agentic reasoning degrades.

The results were a fascinating, slow-motion lobotomy.

While response speed definitely improved as the parameters shrank, the capability drop-off was massive:

  • 9B (The Current Default): Trustworthy and handles tool orchestration really well, but takes its time. This is the biggest model I could run at decent quant size on my RTX 5060 Ti with 16GB VRAM.
  • 4B (The Floor): Faster, but experiences a noticeable loss of grounding. It starts getting lazy, skipping tool calls to confidently guess facts instead.
  • 2B (Semantic Drift): Loses conversational context entirely. It suffers from severe semantic blur, mixing up similarly shaped concepts in its latent space (like drifting from soccer to completely different sports leagues in my queries).
  • 0.8B (Total Mechanical Failure): Completely incapable of operating agent machinery. It triggers the wrong APIs entirely or gets caught in infinite failure loops.

I'm curious what capabilities the bigger models would open up on a voice assistant AI agent...

submitted by /u/liampetti
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA