The setup
Most eval failures are not exotic. Teams reuse leaked examples, optimize one aggregate metric, trust model judges without calibration, block CI with flaky tests, or move thresholds after seeing results. The classics endure because they are convenient.
Picture this
Mental model
For every eval artifact, ask: what decision does this inform, what failure can it miss, how can it be gamed, and what would make the score non-comparable next month?
Good
The good version has an anti-pattern review checklist. Before launch, reviewers look for leakage, threshold drift, missing slices, judge bias, unowned failures, and stale datasets.
Bad
The bad version says "we have evals" as if the noun itself protects users. A broken eval is not a guardrail; it is a decorative traffic cone.
Ugly
The ugly reality is incentives. Bad evals often survive because they make launches easier. Fixing them may lower scores, slow releases, and create uncomfortable conversations. That is the work.
Artifact to produce
Maintain an eval anti-pattern register: smell, likely hidden failure, detection method, and replacement pattern.
Anti-pattern review
| Question | Why it matters |
|---|---|
| Which incentive does this eval create? | Bad incentives make bad systems look good. |
| How could this score be gamed? | Gaming reveals weak design. |
| What hidden failure would still pass? | Anti-patterns survive in blind spots. |
Chapter takeaway
Every eval has a smell test. If the smell is "launch justification," open a window and review the design.