The Babel Fish is small, yellow, and leech-like... it feeds on brainwave energy. DSPy optimizers are similar, except they feed on your training data and make your programs dramatically smarter.
— Douglas Adams (adapted)
You've built programs. You've composed pipelines. You've wired up retrieval. Everything works, and if you squint, you might even call it "production-ready."
But here's the uncomfortable truth: every DSPy program you've written so far is running on vibes.
When you wrote dspy.ChainOfThought("context, question -> answer"), DSPy generated a reasonable prompt and sent it to Claude. Claude, being Claude, gave a reasonable answer. But "reasonable" isn't "optimized." The prompt DSPy generated wasn't tuned for your specific task. The examples it showed the LM (if any) weren't selected for maximum impact. The instructions weren't refined through experimentation.
You've been driving a Ferrari in first gear.
This chapter is where we shift up. DSPy's optimizers — officially called teleprompters, because naming things is hard — are the framework's superpower. They take your program, your data, and your metric, and they automatically figure out how to make the LM do a better job. No manual prompt tweaking. No "please try harder" appended to your system message. Just algorithms that search for better ways to instruct the model.
And the best part? Your code doesn't change. Same Signatures. Same Modules. Same forward() method. The optimizer adjusts the parameters — the instructions and demonstrations that DSPy injects into the prompt — while your program structure stays clean.
Before we touch any optimizer, you need to understand what they're actually optimizing.
Every dspy.Predict (or ChainOfThought, or any prediction module) inside your program has hidden parameters:
When you write dspy.ChainOfThought("context, question -> answer"), the instruction is auto-generated from your field names and descriptions, and there are zero demonstrations. That's the "unoptimized" baseline.
An optimizer's job is to find better values for these parameters. LabeledFewShot stuffs in examples from your training data. BootstrapFewShot generates synthetic examples by running your program and keeping the successes. MIPROv2 rewrites the instructions themselves using an LM. They're all searching the same space — they just use different strategies.
Think of it like this:
Your code defines the what. The optimizer finds the how.
Before we dig into code, try running each optimizer below. Pick one, hit OPTIMIZE, and watch the bootstrap → evaluate → (improve) loop play out. Notice how the score changes:
Choose an optimizer
Use when: Few or no labeled examples
Every optimizer needs two things:
1. A training set — examples of inputs (and optionally expected outputs) that represent your task. In DSPy, this is a list of dspy.Example objects.
2. A metric function — a callable that takes (example, prediction, trace=None) and returns a score. This is how the optimizer knows whether its changes made things better or worse.
The quality of your metric determines the quality of your optimization. A vague metric gives vague improvements. A precise metric gives precise improvements. We'll get very specific about metric design in this chapter.
🧪 Lab Notes: In production, I spend more time on the metric function than on the module code. The module is usually straightforward — "retrieve context, generate answer." The metric is where the domain knowledge lives. What counts as a "good" answer for your specific use case? That question deserves serious thought.
Before we build something new, let's see what optimizers can do for a program we already understand. Remember our CodebaseExplorer from Chapter 3? Let's optimize it.
"""
optimize_explorer.py — Optimizing the Chapter 3 Codebase Explorer
"""
import os
import dspy
from dotenv import load_dotenv
load_dotenv()
# Reuse everything from Chapter 3
from codebase_qa import (
load_codebase, chunk_code_files, get_embedder,
CodebaseExplorer, answer_quality_metric, CODEBASE_QA_DATASET,
)
# Setup
lm = dspy.LM("anthropic/claude-sonnet-4-6", temperature=0.7, max_tokens=1500)
dspy.configure(lm=lm)
# Build the retriever (reuse from Ch3)
repo_path = os.path.join(os.path.dirname(__file__), "..", "ch03_retrieval",
"..", "..", "dspy", "dspy")
files = load_codebase(repo_path)
chunks = chunk_code_files(files)
embedder = get_embedder()
retriever = dspy.Embeddings(embedder=embedder, corpus=chunks, k=5)
# Unoptimized baseline
explorer = CodebaseExplorer(retriever=retriever)
evaluator = dspy.Evaluate(
devset=CODEBASE_QA_DATASET,
metric=answer_quality_metric,
num_threads=1,
display_progress=True,
)
baseline_score = evaluator(explorer)
print(f"Baseline score: {baseline_score}")Now let's see what LabeledFewShot does:
from dspy.teleprompt import LabeledFewShot
optimizer = LabeledFewShot(k=3)
optimized_explorer = optimizer.compile(
student=explorer,
trainset=CODEBASE_QA_DATASET,
)
optimized_score = evaluator(optimized_explorer)
print(f"LabeledFewShot score: {optimized_score}")That's it. Two lines to create the optimizer, one to compile. The optimized_explorer is the same CodebaseExplorer class, same forward() method, same retriever — but now each Predict inside it has 3 labeled demonstrations baked into its prompt. The LM sees examples of what good answers look like before it tries to answer your question.
🚨 Gotcha: The
trainsetpassed tocompile()must contain the output fields too, not just inputs. If your Examples only havequestionbut notanswer,LabeledFewShothas nothing to demonstrate. You'll get an optimizer that compiled successfully but changed nothing.
Now let's build something fresh. A Customer Support Ticket Classifier is the perfect optimization playground because:
mkdir ch04_babel_fish && cd ch04_babel_fish
poetry init --name ch04-babel-fish --python ">=3.10,<3.15" -n
poetry add "dspy>=3.1.3,<4.0.0" python-dotenvYour .env:
ANTHROPIC_API_KEY=your-anthropic-key-here
OPENAI_API_KEY=your-openai-key-here
Unlock all 7 chapters with a one-time purchase. No account needed upfront — just pay and get instant access.