MCP Mastery
About
Chapter 11
boss
~40 min

Online Signals And Feedback

Read production signals without mistaking clicks for quality.

Eval Toolkit 2026.05
Observability trace-first
Python 3.11
Reviewed 2026-05-17

Reading this chapter helps prevent 6 common Eval Writing mistakes.

The setup

Online signals are tempting because they are real. They are also noisy, biased, delayed, and easy to misread. A thumbs-up can mean "correct," "fast," "funny," or "I stopped caring."

Picture this

Good, bad, and ugly paths for production feedback evaluation.

Mental model

Classify signals by meaning: explicit rating, user correction, escalation, re-open, dwell time, abandonment, complaint, and downstream business event. Then decide which signals are health indicators and which are investigation triggers.

Good

The good version samples production traces, links feedback to task type and model version, and reviews failures by slice. It treats online signals as complements to offline suites, not replacements.

Bad

The bad version optimizes for engagement because the graph updates quickly. The model becomes chatty, users click more, and task completion quietly worsens. Congratulations, you built social media in miniature.

Ugly

The ugly reality is attribution. A bad outcome might come from retrieval, UI wording, user intent mismatch, or model behavior. Store enough trace context to investigate instead of arguing from one metric.

Artifact to produce

Build a feedback schema: task, model version, trace id, explicit rating, implicit event, user segment, and review status.

Online signal review

QuestionWhy it matters
What does each signal actually mean?Clicks and thumbs are ambiguous.
Which trace fields explain the signal?Attribution needs context.
How are sampled failures reviewed?Raw telemetry needs interpretation.

Chapter takeaway

Production feedback is real, but not automatically wise. Users generate evidence, not neatly labeled training scripture.

References

Quiz

  1. Why are online signals not enough by themselves?

  2. Which is the bad version of production feedback evaluation?

  3. What should the ugly reality change about your process?