Capstone Eval Operating System

The setup

The capstone is an eval operating system: not one script, not one dashboard, and definitely not one heroic notebook. It is the loop that turns product risk into examples, examples into scores, scores into decisions, and production failures back into better evals.

Picture this

Good, bad, and ugly paths for eval operating model.

Mental model

Treat evals as product infrastructure. There is design-time work, development-time regression, pre-launch gating, rollout monitoring, incident learning, and scheduled refresh. Each stage needs artifacts and owners.

Good

The good version has a versioned dataset, calibrated rubric, validated model judge, CI tiering, rollout gates, trace-backed online monitoring, and a refresh cadence. It can explain why a launch expands, holds, or rolls back.

Bad

The bad version has three disconnected spreadsheets and a notebook only one person can run. It passes when that person is online and Mercury is feeling generous.

Ugly

The ugly reality is maintenance. Eval systems rot unless ownership is explicit. The capstone forces you to name owners, cadences, and escalation paths because future-you deserves fewer mysteries.

Artifact to produce

Assemble the operating packet: eval card, dataset manifest, rubric packet, judge card, CI policy, rollout plan, trace policy, and decision record.

Operating-system review

Question	Why it matters
Which artifacts exist for every lifecycle stage?	The loop needs more than launch-week energy.
Who owns refresh, incidents, and rollout gates?	Ownership keeps evals alive.
How does production failure improve the suite?	Incidents should become future coverage.

Chapter takeaway

The capstone is not a bigger eval. It is a maintenance system for evidence, which is less glamorous and much less likely to betray you.

References

Braintrust evals guide

Capstone Eval Operating System

References

Quiz