Product Risk And Eval Scope

The setup

Eval scope starts with risk, not tooling. A customer-support bot, a medical triage assistant, and a coding agent can all use rubrics, but the failure cost is wildly different. Scope tells you what deserves a golden set, what needs adversarial tests, and what can be monitored with lightweight telemetry.

Picture this

Good, bad, and ugly paths for risk-scoped evaluation.

Mental model

List the failure modes first: wrong answer, unsupported claim, unsafe action, privacy leak, tool misuse, refusal when help is expected, or cost blowout. Then rank them by severity and frequency. Your eval suite should look like that ranking, not like whatever benchmark was easiest to download.

Good

The good version has a risk register. High-severity failures get explicit test cases, human review, and blocking thresholds. Medium-risk failures get regression coverage and slice reporting. Low-risk cosmetic issues are tracked but do not paralyze deployment.

Question	Why it matters
Which failure mode has the highest severity?	Severity drives depth of testing.
Which risk is intentionally out of scope?	Scope honesty prevents fake coverage.
Who can veto launch for this risk?	Risk without authority is theater.

Chapter takeaway

The fastest way to under-test a dangerous feature is to start with tooling instead of failure modes. Naturally, tooling has better logos.

References

Anthropic evaluation guide

Product Risk And Eval Scope

The setup

Picture this

Mental model

Good

Bad

Ugly

Artifact to produce

Risk-scope review

References

Quiz