The setup
A failed eval without trace context is a fortune cookie. You know something went wrong, but not where. Traces turn failed examples into debuggable artifacts: inputs, prompts, tool calls, retrieval results, model outputs, judge output, and timing.
Picture this
Mental model
Attach every score to a run id. The run should preserve model version, prompt version, data version, retrieved docs, tool responses, judge prompt, and score explanation. If you cannot reproduce it, you cannot confidently fix it.
Good
The good version lets a developer open a failed example and see the exact chain of events. It tags failures by component so product fixes do not all become prompt tweaks.
Bad
The bad version stores only final answer and score. When the score drops, everyone debates from memory. This is called observability if you squint and have no standards.
Ugly
The ugly reality is storage and privacy. Traces may include sensitive data. Use redaction, retention policy, access control, and sampled retention rather than pretending logs are free.
Artifact to produce
Define trace requirements: identifiers, versions, spans, redaction policy, retention, failure tags, and links from eval report to trace.
Trace review
| Question | Why it matters |
|---|---|
| Can a failed score link to a run trace? | Debugging needs artifacts. |
| Are prompt, model, data, and judge versions captured? | Versions explain regressions. |
| What must be redacted or retained? | Observability still has privacy obligations. |
Chapter takeaway
A failed eval without a trace is just a complaint with formatting. Capture the path, not just the bruise.