MCP Mastery
About
Chapter 15
boss
~40 min

Eval Observability And Traces

Debug failed evals with traces, spans, and reproducible artifacts.

Eval Toolkit 2026.05
Observability trace-first
Python 3.11
LangSmith linked
Reviewed 2026-05-17

Reading this chapter helps prevent 6 common Eval Writing mistakes.

The setup

A failed eval without trace context is a fortune cookie. You know something went wrong, but not where. Traces turn failed examples into debuggable artifacts: inputs, prompts, tool calls, retrieval results, model outputs, judge output, and timing.

Picture this

Good, bad, and ugly paths for trace-backed evaluation.

Mental model

Attach every score to a run id. The run should preserve model version, prompt version, data version, retrieved docs, tool responses, judge prompt, and score explanation. If you cannot reproduce it, you cannot confidently fix it.

Good

The good version lets a developer open a failed example and see the exact chain of events. It tags failures by component so product fixes do not all become prompt tweaks.

Bad

The bad version stores only final answer and score. When the score drops, everyone debates from memory. This is called observability if you squint and have no standards.

Ugly

The ugly reality is storage and privacy. Traces may include sensitive data. Use redaction, retention policy, access control, and sampled retention rather than pretending logs are free.

Artifact to produce

Define trace requirements: identifiers, versions, spans, redaction policy, retention, failure tags, and links from eval report to trace.

Trace review

QuestionWhy it matters
Can a failed score link to a run trace?Debugging needs artifacts.
Are prompt, model, data, and judge versions captured?Versions explain regressions.
What must be redacted or retained?Observability still has privacy obligations.

Chapter takeaway

A failed eval without a trace is just a complaint with formatting. Capture the path, not just the bruise.

References

Quiz

  1. What makes a failed eval actionable?

  2. Which is the bad version of trace-backed evaluation?

  3. What should the ugly reality change about your process?