I Built a Practical Guide to LLM Engineering: RAG, Retrieval, Rerankers, and Evaluation
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
If you’re building LLM apps and feel confused about when to use keyword search, embeddings, rerankers, or vector databases, this repo is for that.
I built a docs-first repo on practical LLM system design patterns, covering pre-filtering, hybrid retrieval, rerankers, in-memory scoring vs vector DBs, batching, cleanup, and LLM-as-judge evaluation, with simple Python examples.
From my experience, embedding quality or RAG alone is rarely the full answer. The engineering harness around the LLM usually matters just as much as the model itself when building a real business solution.
The goal is to make this useful for both newcomers and working developers who want a clearer mental model for building reliable LLM systems.
Repo: https://github.com/SaqlainXoas/llm-system-patterns
I’d love feedback on it. If you find it useful, feel free to star the repo as well. I’d also be interested to hear your own engineering findings around retrieval, embeddings, reranking, RAG, evaluation, and where these approaches work or break in practice.
[link] [comments]
More from r/LocalLLaMA
-
Higgs Audio v3 TTS 4B. Built for voice chat. Support 100 languages and inline control.
Jun 4
-
BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)
Jun 4
-
cyankiwi AWQ 4-bit — 26.05 update, NVFP4 + FP8 Dynamic quantization and benchmarks across Qwen3.6 4-bit quants
Jun 4
-
You guys were right - Qwen 3.6 35B IS good...and KV Cache DOES matter.
Jun 4
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.