Writing Effective Evals Reference

Glossary, diagram atlas, and the eval patterns you will actually touch when the demo graduates into a system.

Version posture

Stack	Pinned in chapters	Why it matters
Eval toolkit	2026.05	Rubrics, datasets, judges, and launch thresholds.
Observability	trace-first	Trace artifacts make failed evals debug-friendly instead of mystical.
Python	3.11+	Modern async ergonomics without pretending 3.8 is fine forever.

Diagram atlas

ew-what-is-an-eval — What Is An Eval?
ew-product-risk-and-eval-scope — Product Risk And Eval Scope
ew-task-metrics-and-user-outcomes — Task Metrics And User Outcomes
ew-dataset-design-and-sampling — Dataset Design And Sampling
ew-rubrics-and-human-judgment — Rubrics And Human Judgment
ew-llm-as-judge-protocols — LLM-As-Judge Protocols
ew-pairwise-and-ranking-evals — Pairwise And Ranking Evals
ew-regression-suites-and-ci — Regression Suites And CI
ew-rag-grounding-and-citations — RAG Grounding And Citations
ew-safety-risk-and-policy-evals — Safety Risk And Policy Evals
ew-online-signals-and-feedback — Online Signals And Feedback
ew-experiments-canaries-and-rollouts — Experiments Canaries And Rollouts
ew-drift-monitoring-and-refresh — Drift Monitoring And Refresh
ew-cost-latency-and-sampling — Cost Latency And Sampling
ew-eval-observability-and-traces — Eval Observability And Traces
ew-reporting-and-decision-records — Reporting And Decision Records
ew-anti-patterns-and-failure-modes — Anti-Patterns And Failure Modes
ew-capstone-eval-operating-system — Capstone Eval Operating System
ew-playbook-lifecycle — Writing Effective Evals Playbook

Glossary

Golden set

A curated set of examples with trusted expected behavior. Good use: stable regression checks. Bad use: final exam everyone trains on. Ugly reality: it drifts unless refreshed.

LLM-as-judge

A model-based evaluator used with calibration, oracle checks, and disagreement review. Good use: scalable triage. Bad use: unquestioned truth machine. Ugly reality: judges inherit bias and drift.

Slice evaluation

Reporting quality by user intent, domain, risk, or cohort. Good use: reveal hidden regressions. Bad use: slices invented after bad news. Ugly reality: small slices can be noisy.

Canary rollout

A staged production exposure tied to pre-registered guardrails. Good use: limit blast radius. Bad use: launch theater without thresholds. Ugly reality: sparse signals make judgment hard.

Judge drift

A change in evaluator behavior over time. Good use: monitored and recalibrated. Bad use: ignored because the graph still renders. Ugly reality: vendor updates happen.

Eval-in-CI

Automated evaluation in delivery pipelines. Good use: deterministic blockers for high-risk regressions. Bad use: flaky gates everyone bypasses. Ugly reality: some checks should be informational.