The setup
LLM judges are useful because they scale judgment-like review. They are dangerous because they sound authoritative while being sensitive to prompt wording, answer order, verbosity, and their own model-family preferences.
Picture this
Mental model
A judge is another model component. It needs a prompt, examples, versioning, calibration data, and monitoring. If you would not deploy the production model without evals, do not deploy the judge without evals either. Yes, evals for evals. The recursion is annoying because it is necessary.
Good
The good version compares judge labels against human-labeled gold examples, reports agreement by slice, randomizes answer order for pairwise checks, and audits disagreements. It tracks judge prompt version and model version with every score.
Bad
The bad version asks "Which answer is better?" once, stores the result, and treats it as truth. It never checks whether the judge prefers longer answers or the first option. This is how automation becomes a confident intern with root access.
Ugly
The ugly reality is drift. A vendor model update, prompt tweak, or new answer style can move judge behavior. Production eval systems need periodic recalibration and alerting on judge-score distribution shifts.
Artifact to produce
Build a judge card: task, rubric, judge prompt, model version, calibration set, agreement rate, known biases, and review escalation rules.
Judge review
| Question | Why it matters |
|---|---|
| What gold set calibrates the judge? | A judge without calibration is just another model opinion. |
| How are answer order and verbosity bias tested? | Judges love accidental shortcuts too. |
| When does human adjudication override the judge? | Automation needs an escalation boundary. |
Chapter takeaway
Treat the judge like production code. It has versions, failure modes, and a habit of sounding confident at the worst possible time.