Task Metrics And User Outcomes

The setup

A metric is useful only if it predicts or protects something users care about. Exact match, pass rate, helpfulness, faithfulness, escalation rate, and time-to-resolution all answer different questions. Treating them as interchangeable is how a model gets "better" while customers get angrier.

Picture this

Good, bad, and ugly paths for outcome-aligned measurement.

Mental model

Start with the user journey. What does success look like at the end of the task? Then choose proxy metrics that are close enough to that success to be actionable. For support, that might mean correct resolution, safe escalation, citation fidelity, and no needless loopbacks.

Question	Why it matters
What user outcome does this metric proxy?	Metrics detached from outcomes optimize the wrong thing.
Which slice could regress while the average improves?	Aggregate quality is where bad slices hide.
What online signal would confirm or challenge this offline metric?	Offline and online evidence should disagree productively.

Chapter takeaway

If nobody can explain why a metric matters to a user, do not put it on the launch slide. The slide has suffered enough.

References

Eugene Yan on LLM evaluation

Task Metrics And User Outcomes

The setup

Picture this

Mental model

Good

Bad

Ugly

Artifact to produce

Metric review

References

Quiz