The setup
Rubrics translate fuzzy expectations into reviewable criteria. They do not remove judgment; they make judgment explicit enough to compare. A good rubric says what evidence earns a score and what failure forces a lower score.
Picture this
Mental model
Use levels that a reviewer can distinguish from the artifact alone. "Excellent" and "pretty good" are vibes. "Cites every factual claim using retrieved sources" is a criterion. Include counterexamples so reviewers know where the boundary lives.
Good
The good version starts with calibration. Reviewers label the same examples, discuss disagreements, update rubric wording, and only then label at scale. Disagreement rate becomes a health metric for the rubric.
Bad
The bad version sends a vague spreadsheet to three reviewers and averages the results. When scores diverge, it blames the reviewers instead of the rubric, because accountability is apparently optional if the column has decimals.
Ugly
The ugly reality is that expert time is scarce. Use human review where it anchors model judges, resolves high-risk disagreements, or refreshes gold labels. Do not burn experts on examples automation can safely triage.
Artifact to produce
Create a rubric packet: criteria, score levels, examples, counterexamples, reviewer instructions, and calibration notes.
Rubric review
| Question | Why it matters |
|---|---|
| Can reviewers apply each criterion from the artifact alone? | Observable criteria reduce vibes. |
| What disagreement rate triggers rubric revision? | Disagreement is diagnostic, not embarrassing. |
| Which examples define score boundaries? | Boundary examples make calibration real. |
Chapter takeaway
A rubric should make disagreement useful. If it only makes disagreement quieter, congratulations on inventing bureaucracy.