The setup
Some tasks resist absolute scoring. Is answer A more helpful than answer B? Is the new summarizer less verbose but equally faithful? Pairwise evaluation can be easier for humans and model judges than assigning a standalone numeric score.
Picture this
Mental model
Pairwise evals answer relative preference, not absolute readiness. They are strong for comparing variants, weak for proving safety or correctness alone. A winner can still be unacceptable if both outputs hallucinate with style.
Good
The good version randomizes left/right order, hides variant identity, uses tie options, reports confidence intervals, and samples by slice. It combines preference with hard checks for policy, grounding, and critical facts.
Bad
The bad version always shows the candidate on the left, counts wins, and celebrates. It never asks whether judges prefer longer answers or whether both options failed the core requirement. Sports, but make it evaluation.
Ugly
The ugly reality is fatigue. Human pairwise review gets expensive and inconsistent. Use active sampling: send close calls, high-risk slices, and judge-disagreement cases to humans instead of reviewing everything manually.
Artifact to produce
Create a pairwise protocol: variants, blind presentation, randomization, tie handling, slice sampling, adjudication, and stop rule.
Pairwise review
| Question | Why it matters |
|---|---|
| Is presentation order randomized? | Position bias can manufacture winners. |
| Are ties allowed? | Forced choices exaggerate tiny differences. |
| What hard checks must both answers still pass? | Preferred is not the same as acceptable. |
Chapter takeaway
Pairwise evals are great for choosing between candidates. They are terrible at proving both candidates are not nonsense.