Pairwise And Ranking Evals

The setup

Some tasks resist absolute scoring. Is answer A more helpful than answer B? Is the new summarizer less verbose but equally faithful? Pairwise evaluation can be easier for humans and model judges than assigning a standalone numeric score.

Picture this

Good, bad, and ugly paths for pairwise preference testing.

Mental model

Pairwise evals answer relative preference, not absolute readiness. They are strong for comparing variants, weak for proving safety or correctness alone. A winner can still be unacceptable if both outputs hallucinate with style.

Good

The good version randomizes left/right order, hides variant identity, uses tie options, reports confidence intervals, and samples by slice. It combines preference with hard checks for policy, grounding, and critical facts.

Bad

The bad version always shows the candidate on the left, counts wins, and celebrates. It never asks whether judges prefer longer answers or whether both options failed the core requirement. Sports, but make it evaluation.

Ugly

The ugly reality is fatigue. Human pairwise review gets expensive and inconsistent. Use active sampling: send close calls, high-risk slices, and judge-disagreement cases to humans instead of reviewing everything manually.

Artifact to produce

Create a pairwise protocol: variants, blind presentation, randomization, tie handling, slice sampling, adjudication, and stop rule.

Pairwise review

Question	Why it matters
Is presentation order randomized?	Position bias can manufacture winners.
Are ties allowed?	Forced choices exaggerate tiny differences.
What hard checks must both answers still pass?	Preferred is not the same as acceptable.

Chapter takeaway

Pairwise evals are great for choosing between candidates. They are terrible at proving both candidates are not nonsense.

References

Chatbot Arena paper

Pairwise And Ranking Evals

References

Quiz