RAG Explained: What It Is and How to Build One (2025)
RAG (Retrieval-Augmented Generation) connects an LLM to your private documents — 4-phase pipeline (embed —> retrieve —> augment —> generate), RAG vs fine-tuning comparison, vector databases, full Python LangChain + Chroma working example, and Claude swap.
1. What is RAG?
RAG (Retrieval-Augmented Generation) is a technique that connects an LLM to an external knowledge base — letting the model answer questions about your private documents, company data, or any information not in its training data.
The problem RAG solves
LLMs are trained on public data up to a cutoff date. They know nothing about:
- —Your company's internal docs, PDFs, Confluence pages
- —Information from after their training cutoff
- —Proprietary databases, internal policies, customer records
RAG solution: when the user asks a question, first retrieve the relevant chunks from your knowledge base, then pass them to the LLM along with the question.
2. How RAG works — 4 phases
Every RAG pipeline has two stages: ingestion (one-time setup) and query (at runtime).
Phase 1 — Ingestion (one-time setup)
- ✓Split documents into chunks (typically 500—1,000 characters each with overlap)
- ✓Embed each chunk using an embedding model —> converts text to a vector (list of numbers representing semantic meaning)
- ✓Store the vectors in a vector database alongside the original text
Phase 2 — Retrieval (at query time)
- ✓Embed the user's question with the same embedding model
- ✓Search the vector database for the most similar chunks (semantic search)
- ✓Return the top-K chunks (typically 3—10) most relevant to the question
Phase 3 — Augmentation
- ✓Construct a prompt: [Retrieved chunks] + “Based on the above context, answer: [user question]”
- ✓The LLM now has the relevant context in its prompt window
Phase 4 — Generation
- ✓LLM generates a response based on the retrieved context
- ✓Response is grounded in your documents, not just training data
Pipeline overview
3. Why RAG vs fine-tuning?
RAG and fine-tuning solve different problems. Here's when to use each:
| RAG | Fine-tuning | |
|---|---|---|
| Data updates | Real-time (add documents) | Requires re-training |
| Cost | Low (API calls + vector DB) | High (GPU compute) |
| Control | See which docs were retrieved | Black box |
| Hallucination risk | Lower (grounded in docs) | Still possible |
| Best for | Private data Q&A, knowledge bases | Changing model style/tone/domain |
4. Vector databases
The key infrastructure component. A vector database stores embeddings and allows fast similarity search — finds the most semantically similar vectors to a query.
Chroma — open source, local, zero setup
Best for development and small-scale RAG. Runs entirely in-process — no server, no account required.
pip install chromadb Pinecone — fully managed, scales to billions of vectors
Free tier (1M vectors). Best for production: no infrastructure to manage, auto-scaling, global replication. pinecone.io
Weaviate — open source + cloud, hybrid search
Good for production hybrid search (semantic + keyword BM25). Self-host or managed cloud. Strong GraphQL API.
Qdrant — open source + cloud, fast Rust performance
Strong performance benchmarks, filterable payloads, good Python SDK.
pgvector — vector search inside PostgreSQL
PostgreSQL extension — add vector search to your existing Postgres database. No new infrastructure if you already run Postgres.
5. Building a simple RAG pipeline — full Python example
A complete working RAG pipeline with LangChain, OpenAI embeddings, and Chroma.
pip install langchain langchain-openai chromadb openai
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
# 1. Load and split documents
loader = TextLoader("your_document.txt")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
# 2. Embed and store in Chroma
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# 3. Create retriever + QA chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4o-mini"),
retriever=retriever,
return_source_documents=True
)
# 4. Ask questions
result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])
print("\nSources:", [doc.metadata for doc in result["source_documents"]]) This loads a text file, splits it into 500-character chunks, embeds them with OpenAI, stores in Chroma, and creates a question-answering chain that returns source documents alongside the answer.
6. Using Claude instead of GPT
The retrieval layer is LLM-agnostic. Swap one line to use Claude:
from langchain_anthropic import ChatAnthropic
qa_chain = RetrievalQA.from_chain_type(
llm=ChatAnthropic(model="claude-3-5-sonnet-20241022"),
retriever=retriever,
return_source_documents=True
) Install pip install langchain-anthropic and set your ANTHROPIC_API_KEY. The rest of the chain — retriever, Chroma, chunking — stays the same. LangChain also supports Llama (via Ollama), Mistral, and Gemini with the same pattern.
7. RAG best practices
Chunk size matters
500—1,000 chars is a good default. Smaller chunks = more precise retrieval; larger chunks = more context per chunk. Tune based on your document type.
Overlap reduces cut-off issues
50—100 char overlap between chunks prevents splitting mid-sentence. Use chunk_overlap=50 in RecursiveCharacterTextSplitter.
Metadata filtering
Store document metadata (filename, date, category) alongside vectors. Filter retrieval by metadata — “only search this year's docs” — to reduce noise and improve precision.
Test retrieval quality first
Print the retrieved chunks before the LLM step. If retrieval is wrong, the answer will be wrong — no matter how good the LLM is. Retrieval is the most common failure point.
Use source citations
Show users which documents the answer came from — builds trust and enables verification. LangChain's return_source_documents=True makes this easy.
Hybrid search
Combine semantic search (vectors) + keyword search (BM25) for better recall. Semantic search handles paraphrasing; keyword search catches exact terms. Weaviate and Elasticsearch support this natively.
8. Common RAG failure modes
Retrieval failure
Correct answer exists in the DB but isn't retrieved. Fix: tune chunk size, add keyword search (hybrid), increase K, review embedding model choice.
Context window overflow
Too many chunks exceed the LLM's context limit. Fix: reduce K, compress chunks with a summarization step, or use a model with a larger context window.
Hallucination
LLM adds information not in the retrieved chunks. Fix: explicit system prompt instruction — “Only answer based on the provided context. If the answer isn't in the context, say so.”
Stale data
Documents in the vector DB are outdated while source docs have been updated. Fix: implement document versioning and a re-indexing pipeline — track document checksums, re-embed changed files automatically.
Monitor the LLM APIs powering your RAG pipelines
RAG pipelines depend on two APIs: an embedding API and a generation API. An OpenAI or Anthropic outage silently breaks both ingestion and query. Prismix detects API degradations in real time so your pipeline doesn't fail silently.
FAQ
What is RAG in AI?
RAG (Retrieval-Augmented Generation) is a technique that connects an LLM to an external knowledge base. When a user asks a question, the system first retrieves relevant document chunks from a vector database, then passes them to the LLM alongside the question. This lets the LLM answer accurately about your private documents, internal data, or recent information not in its training data.
How is RAG different from fine-tuning?
RAG retrieves external information at query time — documents can be updated without retraining. Fine-tuning bakes new knowledge into the model's weights through training. RAG is better for private data Q&A and real-time information. Fine-tuning is better for changing the model's style, tone, or teaching domain-specific response patterns.
What vector database should I use for RAG?
For development: Chroma (local, zero setup, pip install chromadb). For production scale: Pinecone (fully managed, free 1M vectors) or Weaviate (open source + cloud, hybrid search). If you already use PostgreSQL: pgvector adds vector search to your existing database.
Can I build RAG with Claude or Llama?
Yes. LangChain and LlamaIndex support any LLM including Claude (via langchain-anthropic), Llama (via Ollama), Mistral, Gemini, and GPT-4. Swap the LLM in the chain: ChatAnthropic(model="claude-3-5-sonnet-20241022") instead of ChatOpenAI(). The retrieval layer is LLM-agnostic.