Rubrics And Human Judgment

The setup

Rubrics translate fuzzy expectations into reviewable criteria. They do not remove judgment; they make judgment explicit enough to compare. A good rubric says what evidence earns a score and what failure forces a lower score.

Picture this

Good, bad, and ugly paths for rubric calibration.

Mental model

Use levels that a reviewer can distinguish from the artifact alone. "Excellent" and "pretty good" are vibes. "Cites every factual claim using retrieved sources" is a criterion. Include counterexamples so reviewers know where the boundary lives.

Good

The good version starts with calibration. Reviewers label the same examples, discuss disagreements, update rubric wording, and only then label at scale. Disagreement rate becomes a health metric for the rubric.

Bad

The bad version sends a vague spreadsheet to three reviewers and averages the results. When scores diverge, it blames the reviewers instead of the rubric, because accountability is apparently optional if the column has decimals.

Ugly

The ugly reality is that expert time is scarce. Use human review where it anchors model judges, resolves high-risk disagreements, or refreshes gold labels. Do not burn experts on examples automation can safely triage.

Artifact to produce

Create a rubric packet: criteria, score levels, examples, counterexamples, reviewer instructions, and calibration notes.

Rubric review

Question	Why it matters
Can reviewers apply each criterion from the artifact alone?	Observable criteria reduce vibes.
What disagreement rate triggers rubric revision?	Disagreement is diagnostic, not embarrassing.
Which examples define score boundaries?	Boundary examples make calibration real.

Chapter takeaway

A rubric should make disagreement useful. If it only makes disagreement quieter, congratulations on inventing bureaucracy.

References

Hamel Husain on evals

Rubrics And Human Judgment

References

Quiz