r/LocalLLaMA · June 20, 2026 · 2 min read

Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hey folks,

I’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for:

Storage

Upload PDF, DOCX, XLSX, CSV, tables
All data stored locally (no cloud)

Document Ingestion

Watch folder (e.g., Watchdog) → auto‑ingest on file add/modify/delete
Nested folder structure → auto‑tagging
Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG
Version control on re‑upload

Query & Retrieval

Restrict queries to a single client’s documents (no cross‑client leakage)
Structured queries (e.g., “Show invoices > ₹1 lakh”)
Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”)
Keyword fallback

Highlighting & Rendering

Annotated PDF served to frontend
XLSX → colored cell export
Jump directly to highlighted page
Multi‑document highlights in one response

Answer Generation

Local LLM only
Every claim cited with doc + page reference

My Questions

Parsing: I’m considering LlamaIndex LiteParse.
→ Should I store document IDs + chunk IDs for PDFs to enable highlighting?
Vector DB:
- Do I need one (e.g., Qdrant)?
- If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting?
- Would pgvector in Postgres be sufficient?
GraphRAGs:
- How effective are systems like Neo4j or Microsoft GraphRAG?
- Can they run locally/offline, or are they too computationally heavy?
- Is this GraphRAG pipeline a good starting point?
Highlighting UX:
- I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation.
- Any open‑source projects that already do this?
- I found Kotaemon and AnythingLLM, which are close but don’t highlight documents.

TL;DR

Trying to build a local RAG system with:

Storage + ingestion + tagging
Query + retrieval + highlighting
Local LLM answer generation with citations

Looking for advice on:

Vector DB vs pgvector
GraphRAG feasibility offline
Best way to implement document highlighting + citation preview

Would love to hear from anyone who’s built something similar or explored these tools.

submitted by /u/PravalPattam12945RPG
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.