Chapter 3

Life, the Universe, and Retrieval

50 min read

The answer to the ultimate question of Life, the Universe, and Everything is... somewhere in your vector database. Probably. If you indexed it correctly.

— Douglas Adams (adapted)

You've built a Startup Roaster. You've composed a multi-step Lead Intelligence Engine. Your modules are clean, your Pydantic models validate, and you feel like a responsible adult who programs LLMs instead of prompting them.

But there's a problem.

Everything we've built so far relies on one uncomfortable assumption: the LLM already knows what it needs to know. When we asked it to research "Stripe," it drew from its training data — a snapshot of the internet frozen in time, like a fly in amber. What happens when you need answers about your company's docs? Your codebase? Your product knowledge base that was last updated forty-five minutes ago?

You need retrieval. And retrieval, done wrong, is how you end up with an AI that confidently cites a policy document from 2019 while your users politely inform you that everything in that document has been wrong since Q3 2022.

This chapter is about doing it right.

What We're Building: A Codebase Q&A System

We're going to build something deliciously meta: a system that can answer questions about DSPy's own source code. You'll clone the DSPy repo, index it, and ask it questions like "How does the Embeddings class perform similarity search?" or "What modules are available for retrieval?"

Why a codebase Q&A system? Because:

It's useful. Every developer wishes they could ask their codebase questions instead of grep-ing through 139 files.
It's practical. You can point this at your own repos the moment you're done reading.
It's meta. Using DSPy to understand DSPy feels right. Douglas Adams would approve.

By the end, you'll have a system that loads real files, chunks them intelligently, retrieves the most relevant code, and answers questions with citations — all evaluated with metrics so you know it actually works.

Project Setup

mkdir ch03_retrieval && cd ch03_retrieval
poetry init --name ch03-retrieval --python ">=3.10,<3.15" -n
poetry add "dspy>=3.1.3,<4.0.0" python-dotenv

Your .env needs two keys this time — one for the LLM, one for embeddings:

ANTHROPIC_API_KEY=your-anthropic-key-here
OPENAI_API_KEY=your-openai-key-here

We're using Anthropic Claude for generation and OpenAI's embedding model for retrieval. Why mix providers? Because text-embedding-3-small is arguably the best price-to-quality embedding model available, and DSPy — via LiteLLM — doesn't care about brand loyalty. Use the best tool for each job.

You'll also need a copy of the DSPy source code to index. If you've been following along, you already have it. Otherwise:

git clone https://github.com/stanfordnlp/dspy.git

Our import preamble:

import os
from dotenv import load_dotenv

load_dotenv()

import dspy

Step 1: Loading the Codebase

Before we can search code, we need to read it. Here's a loader that walks a repository and returns the contents of every source file:

def load_codebase(repo_path, extensions=(".py",), max_file_lines=500):
    """Load source files from a repository directory.

    Returns a list of dicts with 'path', 'content', and 'language'.
    Skips hidden dirs, __pycache__, and files over max_file_lines
    (which are usually auto-generated or vendored).
    """
    files = []
    for root, dirs, filenames in os.walk(repo_path):
        dirs[:] = [d for d in dirs if not d.startswith(('.', '__'))]
        for fname in filenames:
            if not any(fname.endswith(ext) for ext in extensions):
                continue
            filepath = os.path.join(root, fname)
            rel_path = os.path.relpath(filepath, repo_path)
            try:
                with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
                    lines = f.readlines()
                if len(lines) > max_file_lines or len(lines) < 3:
                    continue
                content = ''.join(lines)
                files.append({
                    'path': rel_path,
                    'content': content,
                    'language': fname.split('.')[-1],
                })
            except (IOError, UnicodeDecodeError):
                continue
    return files

Let's load it:

🔒

The rest of this chapter is for paid readers.

Unlock all 7 chapters with a one-time purchase. No account needed upfront — just pay and get instant access.

←

PreviousThe Restaurant at the End of the Pipeline

NextThe Babel Fish — Optimizers Demystified

→