Capstone: Eval Operating System
Build the final boss in 18-capstone-eval-operating-system. It combines dataset versioning, rubric calibration, model-judge checks, offline regressions, canary rollout gates, trace-backed debugging, and refresh cadence. Small ask. Basically Tuesday, if Tuesday had governance.
- Versioned golden and stress sets with slice-level reporting.
- Human rubric calibration paired with model-judge disagreement review.
- Deterministic CI checks for high-risk regressions.
- Canary rollout thresholds tied to offline and online evidence.
- Trace metadata for failed examples, judge drift, and refresh decisions.
- A Good / Bad / Ugly launch memo reviewers can audit later.