Drift Monitoring And Refresh

The setup

An eval can be correct and stale. User behavior changes, documents update, model versions move, and judges drift. A frozen eval suite becomes a museum exhibit: historically interesting, operationally risky.

Picture this

Good, bad, and ugly paths for eval freshness management.

Mental model

Watch three things: input distribution, failure distribution, and evaluator behavior. If any of them moves, the score may no longer mean what it used to mean.

Good

The good version samples production traffic, compares it to eval coverage, tracks score distribution by model and judge version, and schedules rubric/data refresh. New incidents become regression examples after privacy review.

Bad

The bad version treats last quarter's golden set as sacred. It ignores new intents because changing the dataset would ruin trend lines. Very pure. Very useless.

Ugly

The ugly reality is refresh debt. Updating labels is expensive and politically annoying because scores may drop. That drop is information, not betrayal.

Artifact to produce

Maintain a refresh log: trigger, changed slices, examples added/removed, rubric changes, expected score discontinuity, and reviewer sign-off.

Drift review

Question	Why it matters
Which distribution is monitored?	Input, failure, and judge drift are different.
What triggers dataset refresh?	Refresh needs a rule, not a mood.
How are score discontinuities documented?	Updated evals can change baselines honestly.

Chapter takeaway

A stable score can mean stable quality. It can also mean your eval stopped seeing reality. Annoying distinction, important distinction.

References

Evidently AI drift monitoring

Drift Monitoring And Refresh

References

Quiz