MCP Mastery
About
Chapter 12
boss
~40 min

Experiments Canaries And Rollouts

Connect eval thresholds to staged rollout decisions.

Eval Toolkit 2026.05
Observability trace-first
Python 3.11
Reviewed 2026-05-17

Reading this chapter helps prevent 5 common Eval Writing mistakes.

The setup

A rollout is an eval with users attached. Offline scores reduce uncertainty before exposure; online experiments reveal what offline suites missed. The bridge between them is a pre-registered rollout rule.

Picture this

Good, bad, and ugly paths for staged rollout evaluation.

Mental model

Define gates: offline pass, shadow traffic check, internal dogfood, small canary, larger canary, full launch. Each stage needs metrics, duration, stop conditions, and an owner who can actually stop it.

Good

The good version sets thresholds before the run, separates primary and guardrail metrics, and documents what happens on fail, hold, or expand. It respects sample size and does not declare victory after the first friendly hour.

Bad

The bad version watches dashboards live and launches when the line looks nice. If it dips, someone changes the window. This is not experimentation; it is astrology with SQL.

Ugly

The ugly reality is pressure. Sales wants the feature, support fears tickets, and leadership wants certainty. A written rollout rule protects the team from improvising governance during panic.

Artifact to produce

Write a rollout plan: stage, exposure, primary metric, guardrails, minimum sample, pass/hold/rollback rule, and communication channel.

Rollout review

QuestionWhy it matters
What threshold was set before the rollout?Pre-registration reduces metric shopping.
What sample or duration is required?Early noise is not a launch oracle.
Who can pause or roll back?A stop rule without authority is decorative.

Chapter takeaway

A canary is not a vibe check. It is a controlled exposure with a boringly explicit escape hatch.

References

Quiz

  1. What prevents rollout metrics from becoming dashboard astrology?

  2. Which is the bad version of staged rollout evaluation?

  3. What should the ugly reality change about your process?