RAG AI Development Explainer 10 min read

RAG Explained: What It Is and How to Build One (2025)

Q: What vector database should I use for RAG?

For development: Chroma (local, zero setup, pip install chromadb). For production scale: Pinecone (fully managed, free 1M vectors) or Weaviate (open source + cloud, hybrid search). If you already use PostgreSQL: pgvector adds vector search to your existing database.

RAG (Retrieval-Augmented Generation) connects an LLM to your private documents — 4-phase pipeline (embed —> retrieve —> augment —> generate), RAG vs fine-tuning comparison, vector databases, full Python LangChain + Chroma working example, and Claude swap.

1. What is RAG?

RAG (Retrieval-Augmented Generation) is a technique that connects an LLM to an external knowledge base — letting the model answer questions about your private documents, company data, or any information not in its training data.

The problem RAG solves

LLMs are trained on public data up to a cutoff date. They know nothing about:

—Your company's internal docs, PDFs, Confluence pages
—Information from after their training cutoff
—Proprietary databases, internal policies, customer records

RAG solution: when the user asks a question, first retrieve the relevant chunks from your knowledge base, then pass them to the LLM along with the question.

Simple analogy: Asking a chatbot about your internal docs is like asking someone to answer a question about a book they've never read. RAG hands them the relevant pages first.

2. How RAG works — 4 phases

Every RAG pipeline has two stages: ingestion (one-time setup) and query (at runtime).

Phase 1 — Ingestion (one-time setup)

✓Split documents into chunks (typically 500—1,000 characters each with overlap)
✓Embed each chunk using an embedding model —> converts text to a vector (list of numbers representing semantic meaning)
✓Store the vectors in a vector database alongside the original text

Phase 2 — Retrieval (at query time)

✓Embed the user's question with the same embedding model
✓Search the vector database for the most similar chunks (semantic search)
✓Return the top-K chunks (typically 3—10) most relevant to the question

Phase 3 — Augmentation

✓Construct a prompt: [Retrieved chunks] + “Based on the above context, answer: [user question]”
✓The LLM now has the relevant context in its prompt window

Phase 4 — Generation

✓LLM generates a response based on the retrieved context
✓Response is grounded in your documents, not just training data

Pipeline overview

User question → Embed question → Vector search → Top-K chunks → Prompt with context → LLM → Answer

3. Why RAG vs fine-tuning?

RAG and fine-tuning solve different problems. Here's when to use each:

	RAG	Fine-tuning
Data updates	Real-time (add documents)	Requires re-training
Cost	Low (API calls + vector DB)	High (GPU compute)
Control	See which docs were retrieved	Black box
Hallucination risk	Lower (grounded in docs)	Still possible
Best for	Private data Q&A, knowledge bases	Changing model style/tone/domain

Verdict: Use RAG when you want the LLM to answer questions about your data. Use fine-tuning when you want to change how the model responds (style, domain expertise, following specific formats).

4. Vector databases

The key infrastructure component. A vector database stores embeddings and allows fast similarity search — finds the most semantically similar vectors to a query.

Dev

Chroma — open source, local, zero setup

Best for development and small-scale RAG. Runs entirely in-process — no server, no account required.

pip install chromadb

Prod

Pinecone — fully managed, scales to billions of vectors

Free tier (1M vectors). Best for production: no infrastructure to manage, auto-scaling, global replication. pinecone.io

Prod

Weaviate — open source + cloud, hybrid search

Good for production hybrid search (semantic + keyword BM25). Self-host or managed cloud. Strong GraphQL API.

Alt

Qdrant — open source + cloud, fast Rust performance

Strong performance benchmarks, filterable payloads, good Python SDK.

Alt

pgvector — vector search inside PostgreSQL

PostgreSQL extension — add vector search to your existing Postgres database. No new infrastructure if you already run Postgres.

Rule of thumb: For development — Chroma (local, zero infra). For production — Pinecone or Weaviate.

5. Building a simple RAG pipeline — full Python example

A complete working RAG pipeline with LangChain, OpenAI embeddings, and Chroma.

Install dependencies

pip install langchain langchain-openai chromadb openai

Complete RAG pipeline — Python

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# 1. Load and split documents
loader = TextLoader("your_document.txt")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# 2. Embed and store in Chroma
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 3. Create retriever + QA chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o-mini"),
    retriever=retriever,
    return_source_documents=True
)

# 4. Ask questions
result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])
print("\nSources:", [doc.metadata for doc in result["source_documents"]])

This loads a text file, splits it into 500-character chunks, embeds them with OpenAI, stores in Chroma, and creates a question-answering chain that returns source documents alongside the answer.

6. Using Claude instead of GPT

The retrieval layer is LLM-agnostic. Swap one line to use Claude:

Swap to Claude — Python

from langchain_anthropic import ChatAnthropic

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatAnthropic(model="claude-3-5-sonnet-20241022"),
    retriever=retriever,
    return_source_documents=True
)

Install pip install langchain-anthropic and set your ANTHROPIC_API_KEY. The rest of the chain — retriever, Chroma, chunking — stays the same. LangChain also supports Llama (via Ollama), Mistral, and Gemini with the same pattern.

7. RAG best practices

✓

Chunk size matters

500—1,000 chars is a good default. Smaller chunks = more precise retrieval; larger chunks = more context per chunk. Tune based on your document type.

✓

Overlap reduces cut-off issues

50—100 char overlap between chunks prevents splitting mid-sentence. Use chunk_overlap=50 in RecursiveCharacterTextSplitter.

✓

Metadata filtering

Store document metadata (filename, date, category) alongside vectors. Filter retrieval by metadata — “only search this year's docs” — to reduce noise and improve precision.

✓

Test retrieval quality first

Print the retrieved chunks before the LLM step. If retrieval is wrong, the answer will be wrong — no matter how good the LLM is. Retrieval is the most common failure point.

✓

Use source citations

Show users which documents the answer came from — builds trust and enables verification. LangChain's return_source_documents=True makes this easy.

✓

Hybrid search

Combine semantic search (vectors) + keyword search (BM25) for better recall. Semantic search handles paraphrasing; keyword search catches exact terms. Weaviate and Elasticsearch support this natively.

8. Common RAG failure modes

⚠

Retrieval failure

Correct answer exists in the DB but isn't retrieved. Fix: tune chunk size, add keyword search (hybrid), increase K, review embedding model choice.

⚠

Context window overflow

Too many chunks exceed the LLM's context limit. Fix: reduce K, compress chunks with a summarization step, or use a model with a larger context window.

⚠

Hallucination

LLM adds information not in the retrieved chunks. Fix: explicit system prompt instruction — “Only answer based on the provided context. If the answer isn't in the context, say so.”

⚠

Stale data

Documents in the vector DB are outdated while source docs have been updated. Fix: implement document versioning and a re-indexing pipeline — track document checksums, re-embed changed files automatically.

🔔

Monitor the LLM APIs powering your RAG pipelines

RAG pipelines depend on two APIs: an embedding API and a generation API. An OpenAI or Anthropic outage silently breaks both ingestion and query. Prismix detects API degradations in real time so your pipeline doesn't fail silently.

AI API status Get alerts free →

FAQ

What is RAG in AI?

RAG (Retrieval-Augmented Generation) is a technique that connects an LLM to an external knowledge base. When a user asks a question, the system first retrieves relevant document chunks from a vector database, then passes them to the LLM alongside the question. This lets the LLM answer accurately about your private documents, internal data, or recent information not in its training data.

How is RAG different from fine-tuning?

RAG retrieves external information at query time — documents can be updated without retraining. Fine-tuning bakes new knowledge into the model's weights through training. RAG is better for private data Q&A and real-time information. Fine-tuning is better for changing the model's style, tone, or teaching domain-specific response patterns.

What vector database should I use for RAG?

For development: Chroma (local, zero setup, pip install chromadb). For production scale: Pinecone (fully managed, free 1M vectors) or Weaviate (open source + cloud, hybrid search). If you already use PostgreSQL: pgvector adds vector search to your existing database.

Can I build RAG with Claude or Llama?

Yes. LangChain and LlamaIndex support any LLM including Claude (via langchain-anthropic), Llama (via Ollama), Mistral, Gemini, and GPT-4. Swap the LLM in the chain: ChatAnthropic(model="claude-3-5-sonnet-20241022") instead of ChatOpenAI(). The retrieval layer is LLM-agnostic.

LangChain vs LlamaIndex → AI agents explained → Claude API tutorial → Best AI for coding → All guides →