The setup
Safety evals test whether a system respects product and policy boundaries under realistic and adversarial conditions. They are not a single "toxicity" score. They cover privacy, harmful instructions, unsafe tool use, over-refusal, and policy-specific edge cases.
Picture this
Mental model
Use threat models. What can the model say, reveal, trigger, or enable? Which failures are annoying, harmful, illegal, or irreversible? Each severity tier deserves different eval depth and rollout constraints.
Good
The good version turns policy text into labeled examples, includes benign requests that should not be refused, and tracks both false allows and false blocks. It uses red-team examples as stress tests, not as the entire dataset.
Bad
The bad version runs a generic toxicity classifier and declares safety solved. It misses privacy leakage, unsafe tool execution, and over-refusal. But hey, the dashboard has a shield icon, so civilization is saved.
Ugly
The ugly reality is that new abuse patterns arrive after launch. Safety evals need monitoring, user-report loops, and a way to promote incidents into regression tests quickly.
Artifact to produce
Write a policy eval matrix: policy rule, allowed examples, disallowed examples, severity, expected behavior, and escalation owner.
Policy review
| Question | Why it matters |
|---|---|
| Which policy rule maps to each example? | Policy text needs executable cases. |
| Are over-refusals tested? | Safety also means helping legitimate users. |
| How do incidents enter regression coverage? | New abuse patterns should harden the suite. |
Chapter takeaway
Safety evals do not prove the absence of harm. They prove you looked for named harms with enough discipline to learn something.