Writing Effective Evals Reference
Glossary, diagram atlas, and the eval patterns you will actually touch when the demo graduates into a system.
Version posture
| Stack | Pinned in chapters | Why it matters |
|---|---|---|
| Eval toolkit | 2026.05 | Rubrics, datasets, judges, and launch thresholds. |
| Observability | trace-first | Trace artifacts make failed evals debug-friendly instead of mystical. |
| Python | 3.11+ | Modern async ergonomics without pretending 3.8 is fine forever. |
Diagram atlas
- ew-what-is-an-eval — What Is An Eval?
- ew-product-risk-and-eval-scope — Product Risk And Eval Scope
- ew-task-metrics-and-user-outcomes — Task Metrics And User Outcomes
- ew-dataset-design-and-sampling — Dataset Design And Sampling
- ew-rubrics-and-human-judgment — Rubrics And Human Judgment
- ew-llm-as-judge-protocols — LLM-As-Judge Protocols
- ew-pairwise-and-ranking-evals — Pairwise And Ranking Evals
- ew-regression-suites-and-ci — Regression Suites And CI
- ew-rag-grounding-and-citations — RAG Grounding And Citations
- ew-safety-risk-and-policy-evals — Safety Risk And Policy Evals
- ew-online-signals-and-feedback — Online Signals And Feedback
- ew-experiments-canaries-and-rollouts — Experiments Canaries And Rollouts
- ew-drift-monitoring-and-refresh — Drift Monitoring And Refresh
- ew-cost-latency-and-sampling — Cost Latency And Sampling
- ew-eval-observability-and-traces — Eval Observability And Traces
- ew-reporting-and-decision-records — Reporting And Decision Records
- ew-anti-patterns-and-failure-modes — Anti-Patterns And Failure Modes
- ew-capstone-eval-operating-system — Capstone Eval Operating System
- ew-playbook-lifecycle — Writing Effective Evals Playbook
Glossary
Golden set
A curated set of examples with trusted expected behavior. Good use: stable regression checks. Bad use: final exam everyone trains on. Ugly reality: it drifts unless refreshed.
LLM-as-judge
A model-based evaluator used with calibration, oracle checks, and disagreement review. Good use: scalable triage. Bad use: unquestioned truth machine. Ugly reality: judges inherit bias and drift.
Slice evaluation
Reporting quality by user intent, domain, risk, or cohort. Good use: reveal hidden regressions. Bad use: slices invented after bad news. Ugly reality: small slices can be noisy.
Canary rollout
A staged production exposure tied to pre-registered guardrails. Good use: limit blast radius. Bad use: launch theater without thresholds. Ugly reality: sparse signals make judgment hard.
Judge drift
A change in evaluator behavior over time. Good use: monitored and recalibrated. Bad use: ignored because the graph still renders. Ugly reality: vendor updates happen.
Eval-in-CI
Automated evaluation in delivery pipelines. Good use: deterministic blockers for high-risk regressions. Bad use: flaky gates everyone bypasses. Ugly reality: some checks should be informational.