MCP Mastery
About
Chapter 3
warmup
~35 min

Task Metrics And User Outcomes

Tie scores to user value before the dashboard becomes theater.

Eval Toolkit 2026.05
Observability trace-first
Python 3.11
Reviewed 2026-05-17

Reading this chapter helps prevent 6 common Eval Writing mistakes.

The setup

A metric is useful only if it predicts or protects something users care about. Exact match, pass rate, helpfulness, faithfulness, escalation rate, and time-to-resolution all answer different questions. Treating them as interchangeable is how a model gets "better" while customers get angrier.

Picture this

Good, bad, and ugly paths for outcome-aligned measurement.

Mental model

Start with the user journey. What does success look like at the end of the task? Then choose proxy metrics that are close enough to that success to be actionable. For support, that might mean correct resolution, safe escalation, citation fidelity, and no needless loopbacks.

Good

The good version reports metrics by intent, customer segment, language, risk tier, and tool path. It shows both pass rate and representative failures. A reviewer can see whether improvement came from real quality or from easier traffic.

Bad

The bad version optimizes one aggregate score. It hides that enterprise billing improved while refunds collapsed. But the line chart went up, so naturally everyone clapped for the chart.

Ugly

The ugly reality is proxy drift. A thumbs-up rate can be gamed by shorter answers, and low escalation can mean users gave up. Online signals need interpretation alongside traces and sampled review.

Artifact to produce

Define a metric map: outcome, proxy metric, known blind spot, required slice, and decision threshold.

Metric review

QuestionWhy it matters
What user outcome does this metric proxy?Metrics detached from outcomes optimize the wrong thing.
Which slice could regress while the average improves?Aggregate quality is where bad slices hide.
What online signal would confirm or challenge this offline metric?Offline and online evidence should disagree productively.

Chapter takeaway

If nobody can explain why a metric matters to a user, do not put it on the launch slide. The slide has suffered enough.

References

Quiz

  1. Why should metrics be sliced by intent or risk tier?

  2. Which is the bad version of outcome-aligned measurement?

  3. What should the ugly reality change about your process?