r/LocalLLaMA · June 26, 2026 · 1 min read

Getting real work out of a 4B local model: the distill-on-idle pipeline behind an on-device "memory" assistant

#edge

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

Getting real work out of a 4B local model: the distill-on-idle pipeline behind an on-device "memory" assistant

https://preview.redd.it/iiiqwt96tn9h1.png?width=3004&format=png&auto=webp&s=f02fba9f64e27ac91b2ae4cd478842106b294366

https://preview.redd.it/47cb5u96tn9h1.png?width=3024&format=png&auto=webp&s=b1cee93477970b8b0a636c37be657fecd38ba968

https://preview.redd.it/t45iv1a6tn9h1.png?width=3018&format=png&auto=webp&s=beef94ac59848eb1d61fbf9ac25c3d201201d47a

Posting the engineering, because "local AI assistant" usually means "wrapper around an API" and this crowd will (rightly) call that out.

The problem: turn raw screen capture + meeting transcripts into something queryable, using only models that run comfortably on a laptop, without melting the battery or stealing the GPU from whatever you're actually doing.

What ended up working:

- OCR is not the LLM's job. Apple's Vision framework does on-device OCR; the LLM never burns tokens reading pixels. Huge win on both speed and accuracy.
- Distillation runs on idle, in batches. A 4B-class model (Gemma) summarizes capture into per-project notes when the machine isn't busy. Foreground stays snappy because the heavy lifting waits for slack time.
- Retrieval is hybrid, not pure-vector. SQLite FTS for exact/lexical + LanceDB for semantic, fused. Pure vector search kept missing exact identifiers (ticket numbers, error strings); FTS alone missed paraphrase. Together they're solid.
- Small models are fine when the context is tight. The trick isn't a bigger model, it's giving a small one a small, relevant, well-retrieved slice. Most "the local model is dumb" failures I hit were retrieval failures wearing a costume.

Honest limitations: macOS + Apple Silicon today (leans hard on ScreenCaptureKit + the Neural Engine). Intel works but OCR + inference are noticeably slower. Diarization quality on overlapping speech is still meh.

Whole thing is AGPL - interested in how others here are handling on-idle scheduling and the FTS+vector fusion weighting. Link in comments to keep it clean.

Code: https://github.com/off-grid-ai/desktop. Build from source. Happy to get into the scheduler internals or the retrieval fusion if anyone wants to compare notes.

submitted by /u/alichherawalla
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA