Writing Effective Evals Chapters
Eval toolkit 2026.05 · Observability trace-first · Python ≥3.11
#1
warmup
What Is An Eval?
~30 min
Stop grading one lucky demo and calling it evidence.
#2
warmup
Product Risk And Eval Scope
~30 min
Choose the eval that matches the failure you actually fear.
#3
warmup
Task Metrics And User Outcomes
~35 min
Tie scores to user value before the dashboard becomes theater.
#4
mid
Dataset Design And Sampling
~35 min
Build slices, stress sets, and golden sets without accidental leakage.
#5
mid
Rubrics And Human Judgment
~35 min
Make human labels repeatable enough to argue with productively.
#6
mid
LLM-As-Judge Protocols
~35 min
Use model judges with calibration, oracle checks, and suspicion.
#7
mid
Pairwise And Ranking Evals
~40 min
Compare outputs when absolute scores are too fragile to trust.
#8
mid
Regression Suites And CI
~35 min
Turn evals into guardrails without making every deploy hostage to flakes.
#9
mid
RAG Grounding And Citations
~40 min
Evaluate grounded answers, citation faithfulness, and retrieval misses separately.
#10
boss
Safety Risk And Policy Evals
~35 min
Probe policy failures with proportional severity and honest limits.
#11
boss
Online Signals And Feedback
~40 min
Read production signals without mistaking clicks for quality.
#12
boss
Experiments Canaries And Rollouts
~40 min
Connect eval thresholds to staged rollout decisions.
#13
boss
Drift Monitoring And Refresh
~30 min
Detect when your once-useful eval stopped describing reality.
#14
mid
Cost Latency And Sampling
~35 min
Spend judge calls where they buy confidence, not vibes.
#15
boss
Eval Observability And Traces
~40 min
Debug failed evals with traces, spans, and reproducible artifacts.
#16
mid
Reporting And Decision Records
~30 min
Translate scores into launch decisions people can audit later.
#17
boss
Anti-Patterns And Failure Modes
~35 min
Name the traps before they show up wearing a KPI badge.
#18
boss
Capstone Eval Operating System
~45 min
Assemble offline, online, human, and model judging into a usable loop.