Voice-to-voice chatbot update
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I've been working on this after hours for a few months continuously improving it. Now at a point where the chatbot is close to real-time (thanks to SSE streaming) and also interruptible while preserving context of what was last said. 100% local and powered by Qwen3.5-397B (Unsloth's UD-Q3_K_XL), Whisper-small STT, and Orpheus Q4_K_XL TTS with a custom SNAC decoder on ONNX. VRAM usage holds at 21.3 GB or less leaving decent headroom for compute graphs on a 24 GB GPU. System RAM MoE experts for Qwen occupy about ~150 GB. This is running with bf16 KV cache (Qwen3.5 spazzes out with Q8 KV), at 131,072 tokens. Enough for hours of conversation. GitHub code coming soon - should be able to upload this evening after I'm done with the honey-do list. [link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.