Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Disclosure: I made this. Open-source, MIT, Windows + Linux. Not affiliated with voiceflow.com (the chatbot SaaS, name collision, sorry). Why this exists: I wanted local-only dictation and meeting transcription, because audio shouldn't have to leave the machine just to become text. I had a 6GB GPU sitting there doing nothing most of the day. So I built it: hold a hotkey, faster-whisper transcribes locally, text pastes at the cursor. v1.6.0 shipped today and adds the meetings recorder: mic + system audio into one stereo file, transcribed locally, summary goes through whatever endpoint you point it at (Ollama, llama.cpp, Groq, OpenAI). The only network call in the whole product is the optional summary, and you pick where it goes. The on-topic part for this sub: mini models on real workloads. v1.6.0 was the excuse to actually benchmark this on real meeting transcripts instead of toy prompts. I tried the latest small Qwen first, qwen3.5:0.8b (873M, Q8_0). Test rig: RTX 3060 Laptop 6GB, ~4.3GB free after Whisper loads, Ollama 0.23, Arch. Input: a real 4-minute meeting, ~2900 chars. It works, with one caveat. Ollama's VRAM-aware default num_ctx on this GPU is 4096, and on a reasoning model with thinking-on-by-default that gets eaten before the user-visible tokens land. One-line Modelfile fix: FROM qwen3.5:0.8b PARAMETER num_ctx 16384 After that it streamed a 1562-char structured summary in 57 seconds at 2.2GB of VRAM. TL;DR, decisions, action items, open questions, all there. Better than I'd expect from sub-1B honestly. For the "but you didn't go small enough" counter: I sanity-checked Granite 4.0 350M on the same workload. Speed-wise it crushed (0.6 to 2.8 seconds per summary vs 57s for the 0.8B Qwen) and structure came back clean, sections all in the right places. Then I read the output. On a transcript about Anthropic acquiring Bun, Granite returned "Anthropic's acquisition by Anthropic" and invented Binance as a discussion topic. A different 4-minute meeting came back as a Star Trek bridge log ("Starship Cassiopeia", "Tao City F", colony vessel Andromeda). Keywords matched, relationships scrambled. So qwen3.5:0.8b-vf is the working floor for me, I haven't seen anything coherent come out of sub-500M on real conversation data yet, open to being shown wrong. For people who don't want to run local: Groq's free tier on llama-3.3-70b has been solid. ~2 seconds per summary, output is tighter than the local 0.8B, and the only thing that broke it for me was a 4-hour meeting transcript that blew past their context window. For anything under that, it's a real free option. The actual question I'd like answers on, since this is the sub that knows: long-context structured summarization on low VRAM. The 0.8B Qwen handles a 4-minute meeting comfortably at 16K context. For 1-2 hour transcripts (~30K-60K tokens) on a 6-8GB GPU, what's working? Pushing context wider and eating the VRAM, chunked map-reduce, or a different small model that doesn't fall apart on long inputs. Looking for something that holds structure (TL;DR + sections + bullets) when the input gets long, without needing 24GB of VRAM to do it. App: one .exe on Windows, one .AppImage on Linux. Pyloid + React + faster-whisper + SQLite, CUDA auto-detect with CPU fallback. Model + mic + hotkey done in onboarding in about a minute. Claude was the pair-programming assistant for a lot of boilerplate and the Qt threading gnarliness; architecture and the hard bugs are mine, git history is honest about it. Repo + 1.6.0: https://github.com/infiniV/VoiceFlow https://github.com/infiniV/VoiceFlow/releases/tag/v1.6.0 Web: https://get-voice-flow.vercel.app/ Mostly want to hear answers. Star if it works for you, but a bug report in Issues is more useful. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.