[Release] Nexidion – A private knowledge vault with an autonomous local AI background worker.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hello,
After almost two years of on-and-off development, 5 complete architectural rewrites, and hitting a few brick walls, I’m finally open-sourcing a project I built to scratch my own privacy-paranoia itch: Nexidion.
GitHub Repo: https://github.com/HabermannR/Nexidion
There are a lot of "second brain" apps out there, but I didn't want to rely on a third-party cloud, and I definitely didn't want to send my sensitive notes to closed APIs. More importantly, I didn't just want a standard chat window tacked onto a text editor.
The Local LLM Angle: Autonomous Background Worker
Nexidion is a hierarchical Markdown note-taking app with a built-in, optional autonomous background worker designed specifically to plug into local OpenAI-compatible endpoints (llama.cpp, Ollama, LM Studio, etc.).
Instead of just chatting with your notes, you can select a massive batch of nodes/folders and dispatch the agent to do actual work: * "Reorganize these messy notes into hierarchical folders by topic." * "Summarize these subtrees." * "Extract all action items from these meeting notes."
The safety net: Letting an LLM autonomously organize your notes is terrifying if it hallucinates. Because of this, Nexidion has a built-in version control system. The AI works in the background and commits changes as a new version under the AI's name. Every single edit is fully traceable, and if your local model completely botches the organization, you can revert it with one click. No ruined databases. Zero external network calls.
My "GPU Poor" Setup (2080 Ti)
You don't need a massive multi-GPU rig for the agent to be useful. I am GPU poor and running this on a single RTX 2080 Ti (11GB VRAM).
Right now, I am using the brand new Qwen 3.6 35B-A3B with MTP (specifically the IQ3_XXS quant) using a llama.cpp server backend. It works surprisingly well for the agent tasks!
If anyone with constrained VRAM wants to replicate my setup, here is the exact Docker command I use to squeeze this 35B model onto my 2080 Ti (using flash attention, Q8 KV cache and speculative decoding):
bash docker run --gpus all --rm \ -p 1234:1234 \ -v /mnt/c/.../models/unsloth/Qwen3.5-36B-A3B:/models \ havenoammo/llama:cuda12-server \ -m /models/Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \ --port 1234 --host 0.0.0.0 \ -n -1 --parallel 1 --threads 6 \ --ctx-size 100000 --fit-target 844 \ --mmap -ngl 18 --flash-attn on \ --temp 1.0 --min-p 0.0 --top-p 0.95 --top-k 20 \ --jinja \ --chat-template-kwargs '{"preserve_thinking":true}' \ --ubatch-size 512 --batch-size 2048 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --spec-type ngram-mod,draft-mtp \ --spec-draft-n-max 3
Getting Started
I just finished Dockerizing everything, so spinning up the Postgres DB, backend, frontend, and the AI task runner takes a single command:
bash docker compose --profile with-postgres --profile with-task-runner up -d
(Full docs and setup instructions are in the repo).
I’d love to hear your feedback! I'm especially curious to hear from the local-AI crowd on how the background agent performs with different models/quants and what prompts you find work best for batch organization.
Let me know what you think!
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.