Online Signals And Feedback

The setup

Online signals are tempting because they are real. They are also noisy, biased, delayed, and easy to misread. A thumbs-up can mean "correct," "fast," "funny," or "I stopped caring."

Picture this

Good, bad, and ugly paths for production feedback evaluation.

Classify signals by meaning: explicit rating, user correction, escalation, re-open, dwell time, abandonment, complaint, and downstream business event. Then decide which signals are health indicators and which are investigation triggers.

Good

The good version samples production traces, links feedback to task type and model version, and reviews failures by slice. It treats online signals as complements to offline suites, not replacements.

Bad

The bad version optimizes for engagement because the graph updates quickly. The model becomes chatty, users click more, and task completion quietly worsens. Congratulations, you built social media in miniature.

Ugly

The ugly reality is attribution. A bad outcome might come from retrieval, UI wording, user intent mismatch, or model behavior. Store enough trace context to investigate instead of arguing from one metric.

Artifact to produce

Build a feedback schema: task, model version, trace id, explicit rating, implicit event, user segment, and review status.

Online signal review

Question	Why it matters
What does each signal actually mean?	Clicks and thumbs are ambiguous.
Which trace fields explain the signal?	Attribution needs context.
How are sampled failures reviewed?	Raw telemetry needs interpretation.

Chapter takeaway

Production feedback is real, but not automatically wise. Users generate evidence, not neatly labeled training scripture.

References

Arize Phoenix evaluations