The setup
The capstone is an eval operating system: not one script, not one dashboard, and definitely not one heroic notebook. It is the loop that turns product risk into examples, examples into scores, scores into decisions, and production failures back into better evals.
Picture this
Mental model
Treat evals as product infrastructure. There is design-time work, development-time regression, pre-launch gating, rollout monitoring, incident learning, and scheduled refresh. Each stage needs artifacts and owners.
Good
The good version has a versioned dataset, calibrated rubric, validated model judge, CI tiering, rollout gates, trace-backed online monitoring, and a refresh cadence. It can explain why a launch expands, holds, or rolls back.
Bad
The bad version has three disconnected spreadsheets and a notebook only one person can run. It passes when that person is online and Mercury is feeling generous.
Ugly
The ugly reality is maintenance. Eval systems rot unless ownership is explicit. The capstone forces you to name owners, cadences, and escalation paths because future-you deserves fewer mysteries.
Artifact to produce
Assemble the operating packet: eval card, dataset manifest, rubric packet, judge card, CI policy, rollout plan, trace policy, and decision record.
Operating-system review
| Question | Why it matters |
|---|---|
| Which artifacts exist for every lifecycle stage? | The loop needs more than launch-week energy. |
| Who owns refresh, incidents, and rollout gates? | Ownership keeps evals alive. |
| How does production failure improve the suite? | Incidents should become future coverage. |
Chapter takeaway
The capstone is not a bigger eval. It is a maintenance system for evidence, which is less glamorous and much less likely to betray you.