MCP Mastery
About

Writing Effective Evals Chapters

Eval toolkit 2026.05 · Observability trace-first · Python ≥3.11

#1
warmup

What Is An Eval?

~30 min

Stop grading one lucky demo and calling it evidence.
#2
warmup

Product Risk And Eval Scope

~30 min

Choose the eval that matches the failure you actually fear.
#3
warmup

Task Metrics And User Outcomes

~35 min

Tie scores to user value before the dashboard becomes theater.
#4
mid

Dataset Design And Sampling

~35 min

Build slices, stress sets, and golden sets without accidental leakage.
#5
mid

Rubrics And Human Judgment

~35 min

Make human labels repeatable enough to argue with productively.
#6
mid

LLM-As-Judge Protocols

~35 min

Use model judges with calibration, oracle checks, and suspicion.
#7
mid

Pairwise And Ranking Evals

~40 min

Compare outputs when absolute scores are too fragile to trust.
#8
mid

Regression Suites And CI

~35 min

Turn evals into guardrails without making every deploy hostage to flakes.
#9
mid

RAG Grounding And Citations

~40 min

Evaluate grounded answers, citation faithfulness, and retrieval misses separately.
#10
boss

Safety Risk And Policy Evals

~35 min

Probe policy failures with proportional severity and honest limits.
#11
boss

Online Signals And Feedback

~40 min

Read production signals without mistaking clicks for quality.
#12
boss

Experiments Canaries And Rollouts

~40 min

Connect eval thresholds to staged rollout decisions.
#13
boss

Drift Monitoring And Refresh

~30 min

Detect when your once-useful eval stopped describing reality.
#14
mid

Cost Latency And Sampling

~35 min

Spend judge calls where they buy confidence, not vibes.
#15
boss

Eval Observability And Traces

~40 min

Debug failed evals with traces, spans, and reproducible artifacts.
#16
mid

Reporting And Decision Records

~30 min

Translate scores into launch decisions people can audit later.
#17
boss

Anti-Patterns And Failure Modes

~35 min

Name the traps before they show up wearing a KPI badge.
#18
boss

Capstone Eval Operating System

~45 min

Assemble offline, online, human, and model judging into a usable loop.