MCP Mastery
About

Writing Effective Evals Reference

Glossary, diagram atlas, and the eval patterns you will actually touch when the demo graduates into a system.

Version posture

StackPinned in chaptersWhy it matters
Eval toolkit2026.05Rubrics, datasets, judges, and launch thresholds.
Observabilitytrace-firstTrace artifacts make failed evals debug-friendly instead of mystical.
Python3.11+Modern async ergonomics without pretending 3.8 is fine forever.

Diagram atlas

Glossary

Golden set

A curated set of examples with trusted expected behavior. Good use: stable regression checks. Bad use: final exam everyone trains on. Ugly reality: it drifts unless refreshed.

LLM-as-judge

A model-based evaluator used with calibration, oracle checks, and disagreement review. Good use: scalable triage. Bad use: unquestioned truth machine. Ugly reality: judges inherit bias and drift.

Slice evaluation

Reporting quality by user intent, domain, risk, or cohort. Good use: reveal hidden regressions. Bad use: slices invented after bad news. Ugly reality: small slices can be noisy.

Canary rollout

A staged production exposure tied to pre-registered guardrails. Good use: limit blast radius. Bad use: launch theater without thresholds. Ugly reality: sparse signals make judgment hard.

Judge drift

A change in evaluator behavior over time. Good use: monitored and recalibrated. Bad use: ignored because the graph still renders. Ugly reality: vendor updates happen.

Eval-in-CI

Automated evaluation in delivery pipelines. Good use: deterministic blockers for high-risk regressions. Bad use: flaky gates everyone bypasses. Ugly reality: some checks should be informational.