MCP Mastery
About
Chapter 5
mid
~35 min

Rubrics And Human Judgment

Make human labels repeatable enough to argue with productively.

Eval Toolkit 2026.05
Observability trace-first
Python 3.11
Reviewed 2026-05-17

Reading this chapter helps prevent 5 common Eval Writing mistakes.

The setup

Rubrics translate fuzzy expectations into reviewable criteria. They do not remove judgment; they make judgment explicit enough to compare. A good rubric says what evidence earns a score and what failure forces a lower score.

Picture this

Good, bad, and ugly paths for rubric calibration.

Mental model

Use levels that a reviewer can distinguish from the artifact alone. "Excellent" and "pretty good" are vibes. "Cites every factual claim using retrieved sources" is a criterion. Include counterexamples so reviewers know where the boundary lives.

Good

The good version starts with calibration. Reviewers label the same examples, discuss disagreements, update rubric wording, and only then label at scale. Disagreement rate becomes a health metric for the rubric.

Bad

The bad version sends a vague spreadsheet to three reviewers and averages the results. When scores diverge, it blames the reviewers instead of the rubric, because accountability is apparently optional if the column has decimals.

Ugly

The ugly reality is that expert time is scarce. Use human review where it anchors model judges, resolves high-risk disagreements, or refreshes gold labels. Do not burn experts on examples automation can safely triage.

Artifact to produce

Create a rubric packet: criteria, score levels, examples, counterexamples, reviewer instructions, and calibration notes.

Rubric review

QuestionWhy it matters
Can reviewers apply each criterion from the artifact alone?Observable criteria reduce vibes.
What disagreement rate triggers rubric revision?Disagreement is diagnostic, not embarrassing.
Which examples define score boundaries?Boundary examples make calibration real.

Chapter takeaway

A rubric should make disagreement useful. If it only makes disagreement quieter, congratulations on inventing bureaucracy.

References

Quiz

  1. What is reviewer disagreement usually telling you?

  2. Which is the bad version of rubric calibration?

  3. What should the ugly reality change about your process?