The setup
A metric is useful only if it predicts or protects something users care about. Exact match, pass rate, helpfulness, faithfulness, escalation rate, and time-to-resolution all answer different questions. Treating them as interchangeable is how a model gets "better" while customers get angrier.
Picture this
Mental model
Start with the user journey. What does success look like at the end of the task? Then choose proxy metrics that are close enough to that success to be actionable. For support, that might mean correct resolution, safe escalation, citation fidelity, and no needless loopbacks.
Good
The good version reports metrics by intent, customer segment, language, risk tier, and tool path. It shows both pass rate and representative failures. A reviewer can see whether improvement came from real quality or from easier traffic.
Bad
The bad version optimizes one aggregate score. It hides that enterprise billing improved while refunds collapsed. But the line chart went up, so naturally everyone clapped for the chart.
Ugly
The ugly reality is proxy drift. A thumbs-up rate can be gamed by shorter answers, and low escalation can mean users gave up. Online signals need interpretation alongside traces and sampled review.
Artifact to produce
Define a metric map: outcome, proxy metric, known blind spot, required slice, and decision threshold.
Metric review
| Question | Why it matters |
|---|---|
| What user outcome does this metric proxy? | Metrics detached from outcomes optimize the wrong thing. |
| Which slice could regress while the average improves? | Aggregate quality is where bad slices hide. |
| What online signal would confirm or challenge this offline metric? | Offline and online evidence should disagree productively. |
Chapter takeaway
If nobody can explain why a metric matters to a user, do not put it on the launch slide. The slide has suffered enough.