Eval Writing Labs

Four offline arcs: rubric scoring, pairwise preference, slice regressions, and rollout gates. Each is Python-only, runnable with uv or pytest, and test-backed so the happy path cannot cosplay as evidence.

Rubric Smoke Test

lab-01-rubric-smoke-test

# Rubric Smoke Test Score an answer with a tiny rubric and reject missing evidence. ## Good / Bad / Ugly - **Good**: deterministic inputs, explicit expected behavior, and a result a reviewer can re

Pairwise Preference

lab-02-pairwise-preference

# Pairwise Preference Choose the better output using deterministic criteria. ## Good / Bad / Ugly - **Good**: deterministic inputs, explicit expected behavior, and a result a reviewer can reproduce

Regression Slices

lab-03-regression-slices

# Regression Slices Summarize pass rates by slice so aggregate scores cannot hide failures. ## Good / Bad / Ugly - **Good**: deterministic inputs, explicit expected behavior, and a result a reviewe

Rollout Gate

lab-04-rollout-gate

# Rollout Gate Combine offline and online signals into a launch recommendation. ## Good / Bad / Ugly - **Good**: deterministic inputs, explicit expected behavior, and a result a reviewer can reprod