Evals

Writing Effective Evals Chapters

Eval toolkit 2026.05 · Observability trace-first · Python ≥3.11

What Is An Eval?

~30 min

Stop grading one lucky demo and calling it evidence.

Product Risk And Eval Scope

~30 min

Choose the eval that matches the failure you actually fear.

Task Metrics And User Outcomes

~35 min

Tie scores to user value before the dashboard becomes theater.

Dataset Design And Sampling

~35 min

Build slices, stress sets, and golden sets without accidental leakage.

Rubrics And Human Judgment

~35 min

Make human labels repeatable enough to argue with productively.

LLM-As-Judge Protocols

~35 min

Use model judges with calibration, oracle checks, and suspicion.

Pairwise And Ranking Evals

~40 min

Compare outputs when absolute scores are too fragile to trust.

Regression Suites And CI

~35 min

Turn evals into guardrails without making every deploy hostage to flakes.

RAG Grounding And Citations

~40 min

Evaluate grounded answers, citation faithfulness, and retrieval misses separately.

Safety Risk And Policy Evals

~35 min

Probe policy failures with proportional severity and honest limits.

Online Signals And Feedback

~40 min

Read production signals without mistaking clicks for quality.

Experiments Canaries And Rollouts

~40 min

Connect eval thresholds to staged rollout decisions.

Drift Monitoring And Refresh

~30 min

Detect when your once-useful eval stopped describing reality.

Cost Latency And Sampling

~35 min

Spend judge calls where they buy confidence, not vibes.

Eval Observability And Traces

~40 min

Debug failed evals with traces, spans, and reproducible artifacts.

Reporting And Decision Records

~30 min

Translate scores into launch decisions people can audit later.

Anti-Patterns And Failure Modes

~35 min

Name the traps before they show up wearing a KPI badge.

Capstone Eval Operating System

~45 min

Assemble offline, online, human, and model judging into a usable loop.