RAG is the single most useful pattern in applied AI: it grounds a model in your data so it answers from facts, not vibes. Here is how to build one that survives real users, not just a demo.
The pipeline
- Ingest — pull documents from your sources.
- Chunk — split with structure in mind, not fixed character counts.
- Embed — turn chunks into vectors.
- Store — a vector database (pgvector works great) with good metadata.
- Retrieve — vector search for the top-k relevant chunks.
- Rerank — a second pass that keeps only the best few.
- Ground — build a prompt from the retrieved context.
- Answer with citations — always return sources.
A sketch
def answer(question: str) -> Answer:
docs = retrieve(question, k=8) # vector search
docs = rerank(question, docs)[:4] # keep the best
context = format_context(docs)
reply = llm.complete(SYSTEM, question, context)
return Answer(text=reply, sources=[d.id for d in docs])
What makes it "production"
A demo stops at step 8. A production system adds:
- Evals — a test set + error analysis so you can prove quality.
- Observability — trace the question, retrieved context, tokens, and latency.
- Controls — retries, timeouts, caching, and a cost budget.
- Safety — input validation and prompt-injection defense.
See production-ready GenAI architecture for the full layer list.
Debugging RAG
When answers are wrong, check retrieval first. Log the retrieved context — was the right chunk even fetched? Fix chunking and reranking before you touch the prompt or model. This is also a favorite interview question.
Next
This is project one on the roadmap. Build it, put it on GitHub, and use it as your portfolio centerpiece — see 5 AI projects that get you hired.