Chapter 7: The Answer Is 42 (Tokens)

Forty-two!" yelled Loonquawl. "Is that all you've got to show for seven and a half million years' work?" "I checked it very thoroughly," said the computer, "and that quite definitely is the answer.

— Douglas Adams (adapted)

Deep Thought computed for 7.5 million years and produced the answer 42. Our LLMs compute for about 3 seconds and produce... well, sometimes the answer, sometimes a hallucination, and occasionally a sonnet about the question instead of answering it.

This chapter is about pushing DSPy to its limits and then going a little further. We're going to feed it images and ask it to see. We're going to capture its internal reasoning and inspect it like a black box flight recorder. We're going to build a system that uses cheap models for easy work and expensive models for hard problems — the model equivalent of "don't bring a bazooka to a knife fight." And we're going to survey the most advanced optimization techniques DSPy offers, from evolutionary prompt tuning to reinforcement learning.

Our project: a Multimodal Product Review Analyzer that takes product images alongside text reviews, analyzes both modalities independently, cross-references them, and produces a structured quality report. Along the way, we'll cover dspy.Image, dspy.Reasoning, model cascading, MultiChainComparison, dspy.Parallel, and the advanced optimizers that are too powerful for polite company.

Project Setup

mkdir ch07_advanced && cd ch07_advanced
poetry init --name ch07-advanced --python ">=3.10,<3.15" --no-interaction

# pyproject.toml
[tool.poetry]
name = "ch07-advanced"
version = "0.1.0"
description = "Chapter 7: The Answer Is 42 (Tokens)"
authors = ["Your Name <you@example.com>"]

[tool.poetry.dependencies]
python = ">=3.10,<3.15"
dspy = ">=3.1.3,<4.0.0"
python-dotenv = ">=1.2.2,<2.0.0"
Pillow =

poetry lock && poetry install && poetry shell

We add Pillow for image manipulation and requests for fetching remote images. The .env file is the same as previous chapters.

Multimodal Inputs: Teaching DSPy to See

DSPy treats images, audio, and files as first-class input types. The dspy.Image class handles all the encoding complexity — URLs, local files, PIL images, raw bytes — and normalizes everything into a format the vision-capable LLM can understand.

Creating Images

import dspy

# From a URL (passed directly to the vision model)
img = dspy.Image("https://example.com/product.jpg")

# From a URL with download (base64 encodes it)
img = dspy.Image("https://example.com/product.jpg", download=True)

# From a local file (auto base64 encoded)
img = dspy.Image("/path/to/product_photo.png")

# From a PIL image (programmatic creation)

Under the hood, dspy.Image converts everything into base64 data URIs (data:image/png;base64,...) that vision models consume natively. The encoding is cached (LRU, 32 entries) so repeated use of the same image doesn't re-encode.

Using Images in Signatures

Use dspy.Image as a field type, just like str or int:

class AnalyzeProductImage(dspy.Signature):
    """Analyze a product image alongside its text review."""

    product_image: dspy.Image = dspy.InputField(
        desc="Photo of the product being reviewed"
    )
    review_text: str = dspy.InputField(
        desc="The text of the product review"
    )
    analysis: VisualAnalysis = dspy.OutputField(
        desc=

That's it. DSPy's adapter system handles the rest — formatting the image into the correct multimodal message format for whatever LLM you're using. Claude, GPT-5.4, Gemini — they all receive the image in their native format.

🚨 Gotcha: Not all LLMs support vision. If you pass a dspy.Image to a text-only model, it will fail. Claude Sonnet 4.6, Claude Haiku 4.5, and GPT-5.4 all support images. Check your model's documentation.

Other Multimodal Types

DSPy also supports dspy.Audio (for speech models) and dspy.File (for document models):

# Audio
audio = dspy.Audio.from_file("recording.wav")
audio = dspy.Audio.from_url("https://example.com/audio.mp3")

The Answer Is 42 (Tokens)

The rest of this chapter is for paid readers.

Project Setup

Multimodal Inputs: Teaching DSPy to See

Creating Images

Using Images in Signatures

Other Multimodal Types