The setup
Eval scope starts with risk, not tooling. A customer-support bot, a medical triage assistant, and a coding agent can all use rubrics, but the failure cost is wildly different. Scope tells you what deserves a golden set, what needs adversarial tests, and what can be monitored with lightweight telemetry.
Picture this
Mental model
List the failure modes first: wrong answer, unsupported claim, unsafe action, privacy leak, tool misuse, refusal when help is expected, or cost blowout. Then rank them by severity and frequency. Your eval suite should look like that ranking, not like whatever benchmark was easiest to download.
Good
The good version has a risk register. High-severity failures get explicit test cases, human review, and blocking thresholds. Medium-risk failures get regression coverage and slice reporting. Low-risk cosmetic issues are tracked but do not paralyze deployment.
Bad
The bad version tries to "evaluate everything" and ends up measuring almost nothing well. It spends equal effort on typo style and privacy leakage, because apparently spreadsheets are where prioritization goes to nap.
Ugly
The ugly reality is organizational: legal, product, support, and engineering may disagree on what counts as unacceptable. That disagreement should be captured before launch, not rediscovered during incident review.
Artifact to produce
Create a risk-to-eval matrix: failure mode, severity, expected frequency, eval method, owner, threshold, and escalation path.
Risk-scope review
| Question | Why it matters |
|---|---|
| Which failure mode has the highest severity? | Severity drives depth of testing. |
| Which risk is intentionally out of scope? | Scope honesty prevents fake coverage. |
| Who can veto launch for this risk? | Risk without authority is theater. |
Chapter takeaway
The fastest way to under-test a dangerous feature is to start with tooling instead of failure modes. Naturally, tooling has better logos.