The setup
An eval can be correct and stale. User behavior changes, documents update, model versions move, and judges drift. A frozen eval suite becomes a museum exhibit: historically interesting, operationally risky.
Picture this
Mental model
Watch three things: input distribution, failure distribution, and evaluator behavior. If any of them moves, the score may no longer mean what it used to mean.
Good
The good version samples production traffic, compares it to eval coverage, tracks score distribution by model and judge version, and schedules rubric/data refresh. New incidents become regression examples after privacy review.
Bad
The bad version treats last quarter's golden set as sacred. It ignores new intents because changing the dataset would ruin trend lines. Very pure. Very useless.
Ugly
The ugly reality is refresh debt. Updating labels is expensive and politically annoying because scores may drop. That drop is information, not betrayal.
Artifact to produce
Maintain a refresh log: trigger, changed slices, examples added/removed, rubric changes, expected score discontinuity, and reviewer sign-off.
Drift review
| Question | Why it matters |
|---|---|
| Which distribution is monitored? | Input, failure, and judge drift are different. |
| What triggers dataset refresh? | Refresh needs a rule, not a mood. |
| How are score discontinuities documented? | Updated evals can change baselines honestly. |
Chapter takeaway
A stable score can mean stable quality. It can also mean your eval stopped seeing reality. Annoying distinction, important distinction.