Cost Latency And Sampling

The setup

Evaluation has a budget. Human review costs time, model judges cost money, and full-suite runs cost latency. The answer is not "run less." The answer is to spend evaluation effort where it changes decisions.

Picture this

Good, bad, and ugly paths for evaluation budget management.

Mental model

Tier the suite: deterministic schema checks, cheap lexical or rule checks, sampled judge checks, human review for high-risk or disagreement cases, and full batch runs before major releases.

Good

The good version samples by risk slice, runs heavyweight judges on uncertain or high-impact examples, and tracks cost per decision. It can explain why a test runs nightly instead of per commit.

Bad

The bad version either runs everything always or nothing until launch week. Both are expensive: one in compute, the other in incidents. Choose your invoice.

Ugly

The ugly reality is budget pressure. Teams may remove the most expensive evals first, even if they guard the highest-risk failures. Keep risk and cost visible together.

Artifact to produce

Create an eval budget table: check, cost, runtime, trigger, risk covered, sampling rate, and owner.

Budget review

Question	Why it matters
Which checks are cheap enough for every PR?	Fast checks protect developer flow.
Where is expensive judging concentrated?	Spend where uncertainty matters.
What risk would be exposed by reducing sample size?	Savings can hide weakest slices.

Chapter takeaway

Eval cost is not waste if it prevents bad decisions. It is waste when nobody knows what decision the spend supports.

References

OpenAI cost optimization guide

Cost Latency And Sampling

References

Quiz