Chapter 4

The Babel Fish — Optimizers Demystified

60 min read

The Babel Fish is small, yellow, and leech-like... it feeds on brainwave energy. DSPy optimizers are similar, except they feed on your training data and make your programs dramatically smarter.

— Douglas Adams (adapted)

You've built programs. You've composed pipelines. You've wired up retrieval. Everything works, and if you squint, you might even call it "production-ready."

But here's the uncomfortable truth: every DSPy program you've written so far is running on vibes.

When you wrote dspy.ChainOfThought("context, question -> answer"), DSPy generated a reasonable prompt and sent it to Claude. Claude, being Claude, gave a reasonable answer. But "reasonable" isn't "optimized." The prompt DSPy generated wasn't tuned for your specific task. The examples it showed the LM (if any) weren't selected for maximum impact. The instructions weren't refined through experimentation.

You've been driving a Ferrari in first gear.

This chapter is where we shift up. DSPy's optimizers — officially called teleprompters, because naming things is hard — are the framework's superpower. They take your program, your data, and your metric, and they automatically figure out how to make the LM do a better job. No manual prompt tweaking. No "please try harder" appended to your system message. Just algorithms that search for better ways to instruct the model.

And the best part? Your code doesn't change. Same Signatures. Same Modules. Same forward() method. The optimizer adjusts the parameters — the instructions and demonstrations that DSPy injects into the prompt — while your program structure stays clean.

The Mental Model: Programs Have Parameters

Before we touch any optimizer, you need to understand what they're actually optimizing.

Every dspy.Predict (or ChainOfThought, or any prediction module) inside your program has hidden parameters:

Instructions — the text that tells the LM what to do
Demonstrations — few-shot examples shown to the LM before your actual input

When you write dspy.ChainOfThought("context, question -> answer"), the instruction is auto-generated from your field names and descriptions, and there are zero demonstrations. That's the "unoptimized" baseline.

An optimizer's job is to find better values for these parameters. LabeledFewShot stuffs in examples from your training data. BootstrapFewShot generates synthetic examples by running your program and keeping the successes. MIPROv2 rewrites the instructions themselves using an LM. They're all searching the same space — they just use different strategies.

Think of it like this:

Your Code (fixed)

→

Signature fields — What goes in, what comes out

→

Module.forward() — How steps connect

→

Metric function — What good means

Optimizers Job (variable)

→

Instructions — You are an expert code analyst who...

→

Demonstrations — [Example 1, Example 2, Example 3]

Your code defines the what. The optimizer finds the how.

See It In Action

Before we dig into code, try running each optimizer below. Pick one, hit OPTIMIZE, and watch the bootstrap → evaluate → (improve) loop play out. Notice how the score changes:

Choose an optimizer

Use when: Few or no labeled examples

bootstrap

evaluate

done

Ready

What You Need: Data and a Metric

Every optimizer needs two things:

1. A training set — examples of inputs (and optionally expected outputs) that represent your task. In DSPy, this is a list of dspy.Example objects.

2. A metric function — a callable that takes (example, prediction, trace=None) and returns a score. This is how the optimizer knows whether its changes made things better or worse.

The quality of your metric determines the quality of your optimization. A vague metric gives vague improvements. A precise metric gives precise improvements. We'll get very specific about metric design in this chapter.

🧪 Lab Notes: In production, I spend more time on the metric function than on the module code. The module is usually straightforward — "retrieve context, generate answer." The metric is where the domain knowledge lives. What counts as a "good" answer for your specific use case? That question deserves serious thought.

Warming Up: Optimizing the Codebase Explorer

Before we build something new, let's see what optimizers can do for a program we already understand. Remember our CodebaseExplorer from Chapter 3? Let's optimize it.

"""
optimize_explorer.py — Optimizing the Chapter 3 Codebase Explorer
"""

import os
import dspy
from dotenv import load_dotenv

load_dotenv()

# Reuse everything from Chapter 3
from codebase_qa import (
    load_codebase, chunk_code_files, get_embedder,
    CodebaseExplorer, answer_quality_metric, CODEBASE_QA_DATASET,
)

# Setup
lm = dspy.LM("anthropic/claude-sonnet-4-6", temperature=0.7, max_tokens=1500)
dspy.configure(lm=lm)

# Build the retriever (reuse from Ch3)
repo_path = os.path.join(os.path.dirname(__file__), "..", "ch03_retrieval",
                         "..", "..", "dspy", "dspy")
files = load_codebase(repo_path)
chunks = chunk_code_files(files)
embedder = get_embedder()
retriever = dspy.Embeddings(embedder=embedder, corpus=chunks, k=5)

# Unoptimized baseline
explorer = CodebaseExplorer(retriever=retriever)

evaluator = dspy.Evaluate(
    devset=CODEBASE_QA_DATASET,
    metric=answer_quality_metric,
    num_threads=1,
    display_progress=True,
)

baseline_score = evaluator(explorer)
print(f"Baseline score: {baseline_score}")

Now let's see what LabeledFewShot does:

from dspy.teleprompt import LabeledFewShot

optimizer = LabeledFewShot(k=3)
optimized_explorer = optimizer.compile(
    student=explorer,
    trainset=CODEBASE_QA_DATASET,
)

optimized_score = evaluator(optimized_explorer)
print(f"LabeledFewShot score: {optimized_score}")

That's it. Two lines to create the optimizer, one to compile. The optimized_explorer is the same CodebaseExplorer class, same forward() method, same retriever — but now each Predict inside it has 3 labeled demonstrations baked into its prompt. The LM sees examples of what good answers look like before it tries to answer your question.

🚨 Gotcha: The trainset passed to compile() must contain the output fields too, not just inputs. If your Examples only have question but not answer, LabeledFewShot has nothing to demonstrate. You'll get an optimizer that compiled successfully but changed nothing.

Project Setup: The Ticket Classifier

Now let's build something fresh. A Customer Support Ticket Classifier is the perfect optimization playground because:

Classification is measurable. Either the ticket got the right label or it didn't. No subjective "is this answer good enough?"
The baseline will be mediocre. Without examples, the LM will guess at your categories and get some wrong.
Optimization will show clear gains. Adding demonstrations and tuning instructions moves the needle visibly.

mkdir ch04_babel_fish && cd ch04_babel_fish
poetry init --name ch04-babel-fish --python ">=3.10,<3.15" -n
poetry add "dspy>=3.1.3,<4.0.0" python-dotenv

Your .env:

ANTHROPIC_API_KEY=your-anthropic-key-here
OPENAI_API_KEY=your-openai-key-here

The Task

🔒

The rest of this chapter is for paid readers.

Unlock all 7 chapters with a one-time purchase. No account needed upfront — just pay and get instant access.

←

PreviousLife, the Universe, and Retrieval

NextSo Long, and Thanks for All the Prompts

→