The setup
Eval reports are not trophies. They should say what was tested, what passed, what failed, what is unknown, and what decision follows. A useful report can be read three months later during an incident without requiring tribal memory.
Picture this
Mental model
Use a decision record: context, options, eval evidence, risks accepted, decision, owner, expiry or review date. Scores belong inside this story, not floating alone in a dashboard.
Good
The good version includes slice results, representative failures, threshold comparison, caveats, and action. It names who accepted residual risk and when the decision should be revisited.
Bad
The bad version says "evals look good" in a launch doc. It has no dataset version, no threshold, and no examples. Truly a monument to confidence over content.
Ugly
The ugly reality is political pressure. Decision records help teams separate "we did not know" from "we knew and accepted it." Those are very different incident conversations.
Artifact to produce
Write an eval decision record with: summary, versions, thresholds, pass/fail by slice, top failures, decision, owner, and follow-up date.
Decision-record review
| Question | Why it matters |
|---|---|
| What decision did the report recommend? | Reports should produce action. |
| What risk was accepted explicitly? | Accepted risk should not become surprise risk. |
| When must the decision be revisited? | Eval evidence expires. |
Chapter takeaway
A good eval report is boring in the best way: versions, thresholds, failures, decision, owner. Almost like future readers matter.