Eval Writing Labs
Four offline arcs: rubric scoring, pairwise preference, slice regressions, and rollout gates. Each is Python-only, runnable with uv or pytest, and test-backed so the happy path cannot cosplay as evidence.
Rubric Smoke Test
lab-01-rubric-smoke-test
# Rubric Smoke Test
Score an answer with a tiny rubric and reject missing evidence.
## Good / Bad / Ugly
- **Good**: deterministic inputs, explicit expected behavior, and a result a reviewer can re
Pairwise Preference
lab-02-pairwise-preference
# Pairwise Preference
Choose the better output using deterministic criteria.
## Good / Bad / Ugly
- **Good**: deterministic inputs, explicit expected behavior, and a result a reviewer can reproduce
Regression Slices
lab-03-regression-slices
# Regression Slices
Summarize pass rates by slice so aggregate scores cannot hide failures.
## Good / Bad / Ugly
- **Good**: deterministic inputs, explicit expected behavior, and a result a reviewe
Rollout Gate
lab-04-rollout-gate
# Rollout Gate
Combine offline and online signals into a launch recommendation.
## Good / Bad / Ugly
- **Good**: deterministic inputs, explicit expected behavior, and a result a reviewer can reprod