The answer to the ultimate question of Life, the Universe, and Everything is... somewhere in your vector database. Probably. If you indexed it correctly.
— Douglas Adams (adapted)
You've built a Startup Roaster. You've composed a multi-step Lead Intelligence Engine. Your modules are clean, your Pydantic models validate, and you feel like a responsible adult who programs LLMs instead of prompting them.
But there's a problem.
Everything we've built so far relies on one uncomfortable assumption: the LLM already knows what it needs to know. When we asked it to research "Stripe," it drew from its training data — a snapshot of the internet frozen in time, like a fly in amber. What happens when you need answers about your company's docs? Your codebase? Your product knowledge base that was last updated forty-five minutes ago?
You need retrieval. And retrieval, done wrong, is how you end up with an AI that confidently cites a policy document from 2019 while your users politely inform you that everything in that document has been wrong since Q3 2022.
This chapter is about doing it right.
We're going to build something deliciously meta: a system that can answer questions about DSPy's own source code. You'll clone the DSPy repo, index it, and ask it questions like "How does the Embeddings class perform similarity search?" or "What modules are available for retrieval?"
Why a codebase Q&A system? Because:
By the end, you'll have a system that loads real files, chunks them intelligently, retrieves the most relevant code, and answers questions with citations — all evaluated with metrics so you know it actually works.
mkdir ch03_retrieval && cd ch03_retrieval
poetry init --name ch03-retrieval --python ">=3.10,<3.15" -n
poetry add "dspy>=3.1.3,<4.0.0" python-dotenvYour .env needs two keys this time — one for the LLM, one for embeddings:
ANTHROPIC_API_KEY=your-anthropic-key-here
OPENAI_API_KEY=your-openai-key-here
We're using Anthropic Claude for generation and OpenAI's embedding model for retrieval. Why mix providers? Because text-embedding-3-small is arguably the best price-to-quality embedding model available, and DSPy — via LiteLLM — doesn't care about brand loyalty. Use the best tool for each job.
You'll also need a copy of the DSPy source code to index. If you've been following along, you already have it. Otherwise:
git clone https://github.com/stanfordnlp/dspy.gitOur import preamble:
import os
from dotenv import load_dotenv
load_dotenv()
import dspyBefore we can search code, we need to read it. Here's a loader that walks a repository and returns the contents of every source file:
def load_codebase(repo_path, extensions=(".py",), max_file_lines=500):
"""Load source files from a repository directory.
Returns a list of dicts with 'path', 'content', and 'language'.
Skips hidden dirs, __pycache__, and files over max_file_lines
(which are usually auto-generated or vendored).
"""
files = []
for root, dirs, filenames in os.walk(repo_path):
dirs[:] = [d for d in dirs if not d.startswith(('.', '__'))]
for fname in filenames:
if not any(fname.endswith(ext) for ext in extensions):
continue
filepath = os.path.join(root, fname)
rel_path = os.path.relpath(filepath, repo_path)
try:
with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
lines = f.readlines()
if len(lines) > max_file_lines or len(lines) < 3:
continue
content = ''.join(lines)
files.append({
'path': rel_path,
'content': content,
'language': fname.split('.')[-1],
})
except (IOError, UnicodeDecodeError):
continue
return filesLet's load it:
Unlock all 7 chapters with a one-time purchase. No account needed upfront — just pay and get instant access.