I built an open-source Knowledge Graph pipeline with hybrid retrieval to improve LLM multi-hop reasoning [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Hey everyone,
I built an open-source full-stack pipeline (Django + React) that constructs a Knowledge Graph from raw text, detects thematic communities, and uses hybrid search to solve the "lost in the middle" problem in standard vector retrieval.
The Pipeline:
- Ingestion & Chunking: Raw text is cleaned, parsed, and split into overlapping chunks to preserve local context.
- Graph Construction:
spaCyextracts named entities from each chunk. A weighted co-occurrence graph is built usingNetworkX, mapping which entities appear together and linking them to their source chunks. - Community Detection: The graph is partitioned into thematic clusters using
greedy_modularity_communities. For each cluster, random text chunks are sampled and sent to an LLM to generate a high-level summary (preventing "hub node" bias). - Indexing: All chunks are embedded into a dense vector store, and a sparse BM25 index is built over the same corpus.
- Hybrid Retrieval: On query, the system performs a dual search (Dense Vector + BM25). Simultaneously, it extracts entities from the prompt, traverses the graph for 1st-degree neighbors, and retrieves their associated chunks.
- Fusion & Reranking: Local and Global (community summary) results are merged, deduplicated, and scored using Reciprocal Rank Fusion (RRF). The top-K candidates are then re-scored by a Cross-Encoder for maximum precision.
- LLM Synthesis: The final curated context is passed to the LLM with strict prompting to generate a concise, well-structured, and cited answer.
Why it works:
Standard vector search fails at multi-hop queries like:
Who ordered the execution of Sansa's father, and how did that person eventually die?
By traversing the graph (Sansa -> Ned -> Joffrey -> Poisoning), the system bridges the gap between disconnected text chunks and synthesizes the correct answer.
GitHub: https://github.com/mohammad-majoony/graphrag-studio
Would love feedback! Thanks.
[link] [comments]
More from r/MachineLearning
-
Loss functions in Instance Representation Learning [R]
Jun 29
-
Price elasticity model [R]
Jun 29
-
Rejected MICCAI paper: workshop -> journal/conference or directly journal/conference [R]
Jun 29
-
I built a demo agricultural planning system with an AI advisor for small-scale farmers in Nicaragua using NASA data [p]
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.