LLM-As-Judge Protocols

The setup

LLM judges are useful because they scale judgment-like review. They are dangerous because they sound authoritative while being sensitive to prompt wording, answer order, verbosity, and their own model-family preferences.

Picture this

Good, bad, and ugly paths for model-judge calibration.

Mental model

A judge is another model component. It needs a prompt, examples, versioning, calibration data, and monitoring. If you would not deploy the production model without evals, do not deploy the judge without evals either. Yes, evals for evals. The recursion is annoying because it is necessary.

Good

The good version compares judge labels against human-labeled gold examples, reports agreement by slice, randomizes answer order for pairwise checks, and audits disagreements. It tracks judge prompt version and model version with every score.

Bad

The bad version asks "Which answer is better?" once, stores the result, and treats it as truth. It never checks whether the judge prefers longer answers or the first option. This is how automation becomes a confident intern with root access.

Ugly

The ugly reality is drift. A vendor model update, prompt tweak, or new answer style can move judge behavior. Production eval systems need periodic recalibration and alerting on judge-score distribution shifts.

Artifact to produce

Build a judge card: task, rubric, judge prompt, model version, calibration set, agreement rate, known biases, and review escalation rules.

Judge review

Question	Why it matters
What gold set calibrates the judge?	A judge without calibration is just another model opinion.
How are answer order and verbosity bias tested?	Judges love accidental shortcuts too.
When does human adjudication override the judge?	Automation needs an escalation boundary.

Chapter takeaway

Treat the judge like production code. It has versions, failure modes, and a habit of sounding confident at the worst possible time.

References

G-Eval paper

LLM-As-Judge Protocols

References

Quiz