The setup
Evaluation has a budget. Human review costs time, model judges cost money, and full-suite runs cost latency. The answer is not "run less." The answer is to spend evaluation effort where it changes decisions.
Picture this
Mental model
Tier the suite: deterministic schema checks, cheap lexical or rule checks, sampled judge checks, human review for high-risk or disagreement cases, and full batch runs before major releases.
Good
The good version samples by risk slice, runs heavyweight judges on uncertain or high-impact examples, and tracks cost per decision. It can explain why a test runs nightly instead of per commit.
Bad
The bad version either runs everything always or nothing until launch week. Both are expensive: one in compute, the other in incidents. Choose your invoice.
Ugly
The ugly reality is budget pressure. Teams may remove the most expensive evals first, even if they guard the highest-risk failures. Keep risk and cost visible together.
Artifact to produce
Create an eval budget table: check, cost, runtime, trigger, risk covered, sampling rate, and owner.
Budget review
| Question | Why it matters |
|---|---|
| Which checks are cheap enough for every PR? | Fast checks protect developer flow. |
| Where is expensive judging concentrated? | Spend where uncertainty matters. |
| What risk would be exposed by reducing sample size? | Savings can hide weakest slices. |
Chapter takeaway
Eval cost is not waste if it prevents bad decisions. It is waste when nobody knows what decision the spend supports.