Anti-Patterns And Failure Modes

The setup

Most eval failures are not exotic. Teams reuse leaked examples, optimize one aggregate metric, trust model judges without calibration, block CI with flaky tests, or move thresholds after seeing results. The classics endure because they are convenient.

Picture this

Good, bad, and ugly paths for eval failure-mode recognition.

Mental model

For every eval artifact, ask: what decision does this inform, what failure can it miss, how can it be gamed, and what would make the score non-comparable next month?

Good

The good version has an anti-pattern review checklist. Before launch, reviewers look for leakage, threshold drift, missing slices, judge bias, unowned failures, and stale datasets.

Bad

The bad version says "we have evals" as if the noun itself protects users. A broken eval is not a guardrail; it is a decorative traffic cone.

Ugly

The ugly reality is incentives. Bad evals often survive because they make launches easier. Fixing them may lower scores, slow releases, and create uncomfortable conversations. That is the work.

Artifact to produce

Maintain an eval anti-pattern register: smell, likely hidden failure, detection method, and replacement pattern.

Anti-pattern review

Question	Why it matters
Which incentive does this eval create?	Bad incentives make bad systems look good.
How could this score be gamed?	Gaming reveals weak design.
What hidden failure would still pass?	Anti-patterns survive in blind spots.

Chapter takeaway

Every eval has a smell test. If the smell is "launch justification," open a window and review the design.

References

Google Rules of ML

Anti-Patterns And Failure Modes

References

Quiz