Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hey folks,
I’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for:
Storage
- Upload PDF, DOCX, XLSX, CSV, tables
- All data stored locally (no cloud)
Document Ingestion
- Watch folder (e.g., Watchdog) → auto‑ingest on file add/modify/delete
- Nested folder structure → auto‑tagging
- Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG
- Version control on re‑upload
Query & Retrieval
- Restrict queries to a single client’s documents (no cross‑client leakage)
- Structured queries (e.g., “Show invoices > ₹1 lakh”)
- Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”)
- Keyword fallback
Highlighting & Rendering
- Annotated PDF served to frontend
- XLSX → colored cell export
- Jump directly to highlighted page
- Multi‑document highlights in one response
Answer Generation
- Local LLM only
- Every claim cited with doc + page reference
My Questions
- Parsing: I’m considering LlamaIndex LiteParse.
- → Should I store document IDs + chunk IDs for PDFs to enable highlighting?
- Vector DB:
- Do I need one (e.g., Qdrant)?
- If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting?
- Would pgvector in Postgres be sufficient?
- GraphRAGs:
- How effective are systems like Neo4j or Microsoft GraphRAG?
- Can they run locally/offline, or are they too computationally heavy?
- Is this GraphRAG pipeline a good starting point?
- Highlighting UX:
- I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation.
- Any open‑source projects that already do this?
- I found Kotaemon and AnythingLLM, which are close but don’t highlight documents.
TL;DR
Trying to build a local RAG system with:
- Storage + ingestion + tagging
- Query + retrieval + highlighting
- Local LLM answer generation with citations
Looking for advice on:
- Vector DB vs pgvector
- GraphRAG feasibility offline
- Best way to implement document highlighting + citation preview
Would love to hear from anyone who’s built something similar or explored these tools.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.