DSPy: The Mostly Harmless Guide

Appendix A

Setting Up

import dspy

# Configure a language model
lm = dspy.LM(
    "anthropic/claude-sonnet-4-6",  # provider/model format (via LiteLLM)
    api_key="sk-...",               # or set ANTHROPIC_API_KEY env var
    temperature=0.7,                # 0.0 = deterministic, 1.0+ = creative
    max_tokens=2048,                # max output tokens
)

# Set as global default
dspy.configure(lm=lm)

# Per-request override (thread-safe)
with dspy.context(lm=other_lm):
    result = module(inputs)

Common model strings: anthropic/claude-sonnet-4-6 anthropic/claude-haiku-4-5-20251001 anthropic/claude-opus-4-6 openai/gpt-5.4 openai/gpt-5.4-mini

Defining What You Want: Signatures

# Inline signature (quick and simple)
predictor = dspy.Predict("question -> answer")
predictor = dspy.ChainOfThought("context, question -> answer, confidence")

# Class-based signature (production-grade, with types and descriptions)
class AnalyzeReview(dspy.Signature):
    """Analyze a product review and extract structured insights."""

    review_text: str = dspy.InputField(desc="The product review to analyze")
    category: str = dspy.InputField(desc="Product category for context")
    sentiment: Literal["positive", "negative", "neutral", "mixed"] = dspy.OutputField()
    analysis: MyPydanticModel = dspy.OutputField(desc="Structured analysis")

Key rules: Field names are semantically meaningful — sentiment_label gives better results than output. Docstrings become system instructions. Use Pydantic models for complex structured outputs.

Prediction Modules

Module	What It Does	When to Use
dspy.Predict	Direct inference, no reasoning	Simple extraction, classification
dspy.ChainOfThought	Adds a `rationale` field for step-by-step reasoning	Most tasks — the default workhorse
dspy.MultiChainComparison	Generates M attempts, compares, picks best	Nuanced tasks where quality matters more than cost
dspy.Reasoning	Captures extended thinking (str-like output field)	With reasoning models (o1, o3); use as output field type

# Predict — straightforward
result = dspy.Predict("question -> answer")(question="What is DSPy?")

# ChainOfThought — with reasoning
result = dspy.ChainOfThought("question -> answer")(question="What is DSPy?")
print(result.rationale)  # The model's reasoning steps

# MultiChainComparison — ensemble M attempts
mcc = dspy.MultiChainComparison("question -> answer", M=3, temperature=0.7)

# Reasoning — extended thinking (as output field type)
class DeepAnalysis(dspy.Signature):
    question: str = dspy.InputField()
    reasoning: dspy.Reasoning = dspy.OutputField()  # str-like, captures thinking
    answer: str = dspy.OutputField()

Building Programs: Modules

class MyPipeline(dspy.Module):
    def __init__(self):
        # Declare sub-modules here — DSPy tracks them as parameters
        self.step1 = dspy.ChainOfThought(Step1Signature)
        self.step2 = dspy.Predict(Step2Signature)

    def forward(self, **inputs):
        # Your logic — regular Python, call sub-modules like functions
        result1 = self.step1(field=inputs["field"])
        result2 = self.step2(context=result1.output)
        return dspy.Prediction(final=result2.answer)

# Use it
pipeline = MyPipeline()
output = pipeline(field="some input")
print(output.final)

Composition patterns: Sequential (A → B → C), Branching (if/else on intermediate results), Parallel (dspy.Parallel), Fallback (try/except with dspy.context).

Agents and Tools

# Wrap any function as a tool
def search_web(query: str) -> str:
    """Search the web for information."""
    return requests.get(f"https://api.example.com/search?q={query}").text

tool = dspy.Tool(search_web)

# ReAct — reasoning + acting loop
agent = dspy.ReAct(
    "question -> answer",
    tools=[tool],
    max_iters=5,
)
result = agent(question="What's the latest on DSPy?")

# ProgramOfThought — generates code to solve problems
pot = dspy.ProgramOfThought("question -> answer")

# CodeAct — combines tools + code generation
code_agent = dspy.CodeAct("question -> answer", tools=[tool])

Multimodal Inputs

from PIL import Image as PILImage

# Images — from PIL, URL, file path, or bytes
img = dspy.Image(PILImage.new("RGB", (100, 100), "red"))  # PIL
img = dspy.Image("https://example.com/photo.jpg")          # URL
img = dspy.Image("/path/to/photo.png")                      # File

# Audio and documents
audio = dspy.Audio.from_file("interview.wav")
doc = dspy.File.from_path("report.pdf")

# Use in signatures
class AnalyzeImage(dspy.Signature):
    image: dspy.Image = dspy.InputField(desc="Product photo")
    caption: str = dspy.InputField()
    analysis: str = dspy.OutputField()

Gotcha: Not all models support vision. Claude Sonnet 4.6, Claude Haiku 4.5, and GPT-5.4 do. Check your model's docs.

Retrieval (RAG)

# Set up an embedder
embedder = dspy.Embedder("openai/text-embedding-3-small", batch_size=200)

# Build a retrieval module
retriever = dspy.Embeddings(embedder=embedder, corpus=chunks, k=5)

# Use it
results = retriever("What does dspy.Module do?")

# Save and reload the index
retriever.save("my_index")
loaded_retriever = dspy.Embeddings.from_saved("my_index", embedder=embedder)

Optimizers

The magic of DSPy. These tune your program's prompts and demos automatically.

Optimizer	What It Tunes	Best For	Cost
LabeledFewShot	Demo selection	When you have labeled examples and want a quick win	$
BootstrapFewShot	Self-generated demos	Bootstrapping good examples from a teacher	$$
BootstrapFewShotWithRandomSearch	Demos + hyperparams	More thorough than BootstrapFewShot	$$$
MIPROv2	Instructions + demos	Best general-purpose optimizer	$$–$$$$
SIMBA	Instructions + demos	Memory-efficient alternative to MIPROv2	$$$
BootstrapFinetune	Model weights	When you want to distill into a smaller model	$$$$$
GRPO	Model weights (RL)	When you have a reward function but no labels	$$$$$
BetterTogether	Prompts + weights	Combined optimization (experimental)	$$$$$
GEPA	Instructions (evolutionary)	Sophisticated prompt optimization (experimental)	$$$$

from dspy.teleprompt import BootstrapFewShot, MIPROv2

# Define a metric
def my_metric(example, prediction, trace=None):
    return prediction.sentiment == example.sentiment

# Optimize
optimizer = MIPROv2(metric=my_metric, auto="light")
optimized = optimizer.compile(
    MyPipeline(),
    trainset=train_examples,
)

# Save the optimized program
optimized.save("optimized_pipeline.json")

Decision tree: Start with LabeledFewShot. If you need better → BootstrapFewShot. Still not enough → MIPROv2(auto="light"). For max quality → MIPROv2(auto="heavy"). For fine-tuning → BootstrapFinetune. For RL → GRPO. For experimental cutting-edge → GEPA or BetterTogether.

Evaluation

evaluator = dspy.Evaluate(
    devset=test_examples,
    metric=my_metric,
    num_threads=4,
    display_progress=True,
    display_table=5,       # show 5 example rows
)

score = evaluator(my_pipeline)
print(f"Accuracy: {score}%")

Production Patterns

Streaming

streaming = dspy.streamify(module)
for chunk in streaming(question="Tell me about DSPy"):
    if isinstance(chunk, dspy.Prediction):
        final_result = chunk  # Last item is the complete prediction
    else:
        print(chunk, end="")  # Intermediate text chunks

Async

async_module = dspy.asyncify(module)
result = await async_module(question="Tell me about DSPy")

Batch Processing

results = module.batch(
    examples,
    num_threads=8,
    return_failed_examples=True,
)

Cost Tracking

with dspy.track_usage() as tracker:
    result = module(question="Something")

print(f"Cost: ${tracker.total_cost:.4f}")
print(f"Tokens: {tracker.tokens}")

Caching

from dspy.clients import configure_cache

configure_cache(
    enable_disk_cache=True,
    enable_memory_cache=True,
    disk_cache_dir="~/.dspy_cache",
    disk_size_limit_bytes=2 * 1024**3,  # 2 GB
    memory_max_entries=10_000,
)

Callbacks (Observability)

from dspy.utils.callback import BaseCallback

class MyLogger(BaseCallback):
    def on_module_end(self, call_id, outputs, exception):
        if exception:
            log.error(f"Module failed: {exception}")
        else:
            log.info(f"Module completed: {outputs}")

dspy.configure(lm=lm, callbacks=[MyLogger()])

Save / Load

# Save (must use .json or .pkl extension)
module.save("my_pipeline_v1.json")

# Load
module = MyPipeline()
module.load("my_pipeline_v1.json")

Per-Request Config (Thread-Safe)

# dspy.configure() is NOT thread-safe for concurrent calls
# Use dspy.context() for per-request overrides in web servers
with dspy.context(lm=request_specific_lm):
    result = module(inputs)

Adapters

# Global adapter
dspy.configure(lm=lm, adapter=dspy.JSONAdapter())

# Per-request adapter
with dspy.context(adapter=dspy.XMLAdapter()):
    result = module(inputs)

# Available adapters:
# dspy.ChatAdapter()  — default, uses [[ ## field ## ]] delimiters
# dspy.JSONAdapter()  — forces JSON output
# dspy.XMLAdapter()   — forces XML tags

Debugging

# See what DSPy actually sent to the LLM
dspy.inspect_history(n=3)  # Last 3 LLM calls

# Check module parameters
for name, param in module.named_parameters():
    print(f"{name}: {type(param)}")

Common Import Patterns

# The essentials
import dspy
from pydantic import BaseModel, Field
from typing import Literal, Optional

# Optimizers
from dspy.teleprompt import (
    LabeledFewShot,
    BootstrapFewShot,
    BootstrapFewShotWithRandomSearch,
    MIPROv2,
    SIMBA,
    BootstrapFinetune,
    BetterTogether,
    GEPA,
)
from dspy.teleprompt.grpo import GRPO  # Not in __init__.py

# Callbacks and cache
from dspy.utils.callback import BaseCallback
from dspy.clients import configure_cache

# Tools for agents
# dspy.Tool(function) — wraps any callable

Appendix B

The Optimizer Decision Tree

The single most-asked question in the DSPy Discord: “Which optimizer should I use?” Start at the top and follow the branches.

                    ┌─────────────────────────┐
                    │  Do you have labeled     │
                    │  training examples?      │
                    └────────┬────────┬────────┘
                             │        │
                            YES       NO
                             │        │
                             ▼        ▼
                   ┌─────────────┐  ┌──────────────────┐
                   │ How many?   │  │ Can you write a   │
                   └──┬──────┬───┘  │ reward/metric fn? │
                      │      │      └────────┬──────────┘
                    <50    50+               │
                      │      │              YES
                      ▼      ▼               ▼
              ┌────────┐  ┌──────────┐  ┌──────────────┐
              │Labeled │  │Bootstrap │  │Do you have    │
              │FewShot │  │FewShot   │  │fine-tuning    │
              │        │  │          │  │infrastructure?│
              └────┬───┘  └────┬─────┘  └──┬────────┬──┘
                   │           │           │        │
              Good enough?  Good enough?  YES       NO
                   │           │           │        │
                  NO          NO           ▼        ▼
                   │           │     ┌──────────┐  ┌────────┐
                   ▼           ▼     │  GRPO    │  │MIPROv2 │
              ┌──────────────────┐   │  (RL)    │  │auto=   │
              │   MIPROv2        │   └──────────┘  │"light" │
              │   auto="light"   │                 └───┬────┘
              └────────┬─────────┘                     │
                       │                          Good enough?
                  Good enough?                         │
                       │                              NO
                      NO                               │
                       │                               ▼
                       ▼                         ┌──────────┐
              ┌──────────────────┐               │ MIPROv2  │
              │   MIPROv2        │               │ auto=    │
              │   auto="medium"  │               │ "heavy"  │
              └────────┬─────────┘               └──────────┘
                       │
                  Good enough?
                       │
                      NO
                       │
                       ▼
              ┌──────────────────┐
              │  Want to tune     │
              │  model weights?   │
              └──┬────────────┬──┘
                 │            │
                YES           NO
                 │            │
                 ▼            ▼
          ┌────────────┐  ┌──────────────┐
          │Bootstrap   │  │   SIMBA      │
          │Finetune    │  │   or         │
          │            │  │   GEPA       │
          └────────────┘  │ (experimental)│
                          └──────────────┘

Optimizer Cheat Sheet

Start here — solves 90% of cases:

LabeledFewShot

You have labeled examples and want a quick baseline. Takes seconds, costs almost nothing. Always try this first.

BootstrapFewShot

You have some labels but want DSPy to generate better demonstrations automatically. The teacher model runs your pipeline, keeps the traces that pass your metric. 5–10 minutes, low cost.

$$–$$$

MIPROv2(auto="light")

The recommended general-purpose optimizer. Generates optimized instructions AND selects demos. Start with "light" (6 candidates), move to "medium" (12) or "heavy" (18) if needed.

When you need more:

$$$

SIMBA

Self-reflective mini-batch optimization. Good when MIPROv2's cost is too high, or when you have larger datasets.

$$$$$

BootstrapFinetune

When prompt optimization isn't enough and you want to fine-tune model weights. Requires a fine-tuning-capable model.

$$$$$

GRPO

Reinforcement learning for LMs. No labeled outputs needed — just a reward function. For advanced users with RL infrastructure.

Experimental (but powerful):

$$$$$

BetterTogether

Alternates between prompt optimization and fine-tuning. Strategy string controls the sequence: "p -> w -> p" means optimize prompts, fine-tune weights, optimize prompts again.

$$$$

GEPA

Evolutionary prompt optimization with reflection. The most sophisticated prompt-only optimizer. Requires a reflection_lm parameter.

The Golden Rule

Start simple. Measure. Only escalate when the numbers say you need to. LabeledFewShot → BootstrapFewShot → MIPROv2(auto="light") solves 90% of real-world cases. The fancy optimizers exist for the other 10%.

Closing

Where to Go From Here

You've made it to the end. If you started at Chapter 1 and worked your way through, you've gone from “what is a Signature?” to building multimodal pipelines with model cascading, ensemble reasoning, and advanced optimizers. That's a significant journey.

Stay Current

The DSPy GitHub repository is the source of truth. Watch the releases. The official docs at dspy.ai are the best API reference. The DSPy Discord is where the community lives.

Build Something

The projects in this book are starting points, not endpoints. Extend a chapter project. Combine patterns across chapters — an agent (Ch 5) that uses RAG (Ch 3) with streaming and observability (Ch 6) composes naturally in DSPy.

The Bigger Picture

DSPy is a bet that LLM development shifts from prompting to programming. That bet is paying off. The answer might be 42, but the question — how do we build reliable, maintainable, optimizable AI systems? — is worth spending your career on.

“The ships hung in the sky in much the same way that bricks don't. Your DSPy programs, however, will soar.”

Don't Panic. And don't forget your towel.

←Back to chapters

DSPy Quick Reference

Setting Up

Defining What You Want: Signatures

Prediction Modules

Building Programs: Modules

Agents and Tools

Multimodal Inputs

Retrieval (RAG)

Optimizers

Evaluation

Production Patterns

Streaming

Async

Batch Processing

Cost Tracking

Caching

Callbacks (Observability)

Save / Load

Per-Request Config (Thread-Safe)

Adapters

Debugging

Common Import Patterns

The Optimizer Decision Tree

Optimizer Cheat Sheet

Where to Go From Here

Stay Current

Build Something

The Bigger Picture