This resource demystifies Retrieval Augmented Generation (RAG), the technique that marries information retrieval with large language models for grounded, up-to-date answers. You’ll learn the core architecture, build a working RAG prototype in Python, avoid common pitfalls, and see how tools like Galaxy can power SQL-backed RAG workflows.
Large language models (LLMs) are powerful, but they have two glaring weaknesses:
Retrieval Augmented Generation (RAG) solves both problems by injecting fresh, factual context into the model at inference time. Instead of relying solely on the model’s parameters, RAG retrieves relevant information from an external knowledge base (KB) and feeds that text—often called context—to the generator. The model grounds its response in that context and can cite sources, boosting accuracy and trust.
The canonical choice is a vector database (Pinecone, Weaviate, FAISS, etc.). Each document or chunk is embedded into a high-dimensional vector using an embedding model such as text-embedding-3-small
or all-MiniLM-L6-v2
. For SQL datasets, you can precompute embeddings for column descriptions or query results and store them alongside IDs.
Given a user query, the retriever converts it into an embedding and performs nearest-neighbor search against the vector store, returning the k most similar chunks.
An LLM (GPT-4o, Claude 3, Llama 3, etc.) receives the retrieved chunks as system or user context, then generates the answer. This step is sometimes called sequence generation.
Some stacks include a separate cross-encoder model to score passages more precisely or to reorder retrieved chunks before final generation.
The following 12-minute exercise walks you through a minimal yet production-ready RAG loop. We’ll use:
transformers
+ sentence-transformers
pip install transformers sentence-transformers faiss-cpu openai tiktoken python-dotenv
Step 1. Load and Chunk Documents
import glob, textwrap, tiktoken
from pathlib import Path
def chunk_text(text, chunk_size=400):
wrapper = textwrap.TextWrapper(width=chunk_size, break_long_words=False)
return wrapper.wrap(text)
files = glob.glob("./docs/*.txt")
corpus = []
for f in files:
txt = Path(f).read_text()
corpus.extend(chunk_text(txt))
print(f"Loaded & chunked {len(corpus)} passages")
Step 2. Embed All Passages
from sentence_transformers import SentenceTransformer
import faiss, numpy as np
embedder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedder.encode(corpus, show_progress_bar=True)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(np.array(embeddings).astype("float32"))
Step 3. Build the Retrieval Function
def retrieve(query, k=4):
q_emb = embedder.encode([query])
D, I = index.search(np.array(q_emb).astype("float32"), k)
return [corpus[i] for i in I[0]]
Step 4. Generate Answer with Retrieved Context
import os, openai
openai.api_key = os.getenv("OPENAI_API_KEY")
def rag_answer(question):
context = retrieve(question)
prompt = (
"You are a helpful assistant. Use ONLY the context below to answer the question.\n" +
"\n--- CONTEXT ---\n" + "\n".join(context) +
"\n--- QUESTION ---\n" + question +
"\n--- ANSWER ---\n"
)
resp = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role":"user", "content": prompt}],
temperature=0.2,
)
return resp.choices[0].message.content.strip()
print(rag_answer("What were the key findings in the 2023 retention report?"))
Voilà! You now have a basic RAG system. Swap FAISS for a managed vector DB, add citation formatting, and you can go to production.
Try extending the example:
If your organization’s source of truth is SQL, Galaxy becomes a powerful bridge:
Example: ask "How many paying customers churned last week and what are the top 3 reasons?"
. The retriever pulls the validated “churn” SQL from Galaxy, executes it, joins with a sentiment table, and returns a numeric table and the raw JSON. The generator then crafts a narrative answer citing the query link hosted in Galaxy.
ragas
, truLens
).