Forty-two!" yelled Loonquawl. "Is that all you've got to show for seven and a half million years' work?" "I checked it very thoroughly," said the computer, "and that quite definitely is the answer.
— Douglas Adams (adapted)
Deep Thought computed for 7.5 million years and produced the answer 42. Our LLMs compute for about 3 seconds and produce... well, sometimes the answer, sometimes a hallucination, and occasionally a sonnet about the question instead of answering it.
This chapter is about pushing DSPy to its limits and then going a little further. We're going to feed it images and ask it to see. We're going to capture its internal reasoning and inspect it like a black box flight recorder. We're going to build a system that uses cheap models for easy work and expensive models for hard problems — the model equivalent of "don't bring a bazooka to a knife fight." And we're going to survey the most advanced optimization techniques DSPy offers, from evolutionary prompt tuning to reinforcement learning.
Our project: a Multimodal Product Review Analyzer that takes product images alongside text reviews, analyzes both modalities independently, cross-references them, and produces a structured quality report. Along the way, we'll cover dspy.Image, dspy.Reasoning, model cascading, MultiChainComparison, dspy.Parallel, and the advanced optimizers that are too powerful for polite company.
mkdir ch07_advanced && cd ch07_advanced
poetry init --name ch07-advanced --python ">=3.10,<3.15" --no-interaction# pyproject.toml
[tool.poetry]
name = "ch07-advanced"
version = "0.1.0"
description = "Chapter 7: The Answer Is 42 (Tokens)"
authors = ["Your Name <you@example.com>"]
[tool.poetry.dependencies]
python = ">=3.10,<3.15"
dspy = ">=3.1.3,<4.0.0"
python-dotenv = ">=1.2.2,<2.0.0"
Pillow = ">=11.0.0,<12.0.0"
requests = ">=2.32.0,<3.0.0"
[build-system]
requires = ["poetry-core>=2.0.0,<3.0.0"]
build-backend = "poetry.core.masonry.api"poetry lock && poetry install && poetry shellWe add Pillow for image manipulation and requests for fetching remote images. The .env file is the same as previous chapters.
DSPy treats images, audio, and files as first-class input types. The dspy.Image class handles all the encoding complexity — URLs, local files, PIL images, raw bytes — and normalizes everything into a format the vision-capable LLM can understand.
import dspy
# From a URL (passed directly to the vision model)
img = dspy.Image("https://example.com/product.jpg")
# From a URL with download (base64 encodes it)
img = dspy.Image("https://example.com/product.jpg", download=True)
# From a local file (auto base64 encoded)
img = dspy.Image("/path/to/product_photo.png")
# From a PIL image (programmatic creation)
from PIL import Image as PILImage
pil_img = PILImage.new("RGB", (200, 200), "blue")
img = dspy.Image(pil_img)
# From raw bytes
with open("photo.jpg", "rb") as f:
img = dspy.Image(f.read())Under the hood, dspy.Image converts everything into base64 data URIs (data:image/png;base64,...) that vision models consume natively. The encoding is cached (LRU, 32 entries) so repeated use of the same image doesn't re-encode.
Use dspy.Image as a field type, just like str or int:
class AnalyzeProductImage(dspy.Signature):
"""Analyze a product image alongside its text review."""
product_image: dspy.Image = dspy.InputField(
desc="Photo of the product being reviewed"
)
review_text: str = dspy.InputField(
desc="The text of the product review"
)
analysis: VisualAnalysis = dspy.OutputField(
desc="Visual analysis of the product image"
)That's it. DSPy's adapter system handles the rest — formatting the image into the correct multimodal message format for whatever LLM you're using. Claude, GPT-5.4, Gemini — they all receive the image in their native format.
🚨 Gotcha: Not all LLMs support vision. If you pass a
dspy.Imageto a text-only model, it will fail. Claude Sonnet 4.6, Claude Haiku 4.5, and GPT-5.4 all support images. Check your model's documentation.
DSPy also supports dspy.Audio (for speech models) and dspy.File (for document models):
# Audio
audio = dspy.Audio.from_file("recording.wav")
audio = dspy.Audio.from_url("https://example.com/audio.mp3")Unlock all 7 chapters with a one-time purchase. No account needed upfront — just pay and get instant access.