What Is An Eval?

The setup

An eval is a structured argument about whether a system is ready for a specific decision. The unit is not "model quality" in the abstract. The unit is a product claim: this assistant can answer billing questions with citations, this classifier can abstain on ambiguous tickets, or this summarizer can preserve the contractual clause that matters.

Picture this

Good, bad, and ugly paths for decision-backed evaluation.

Mental model

Think of an eval as a contract with five parts: task, examples, scoring rule, threshold, and action. If any part is missing, people will fill the gap with vibes, seniority, or the last demo they remember. Great, now your launch process is a group chat.

Good

The good version names the product decision first. It says: "We will expand to 10 percent of traffic if billing-answer citation faithfulness is at least 95 percent on high-risk slices and no critical policy failures appear." The eval has a dataset version, rubric version, scorer version, and a documented owner.

Bad

The bad version runs twenty prompts, picks the best transcript, and calls the model improved. It usually has no fixed threshold and no failure taxonomy. If the answer sounds fluent, everyone relaxes, which is adorable until production users ask different questions.

Question	Why it matters
Does the eval name the decision it informs?	Without this, the result is trivia.
Are task, examples, metric, threshold, and action all explicit?	Missing pieces become hallway debate.
Can another reviewer rerun the same procedure?	Repeatability is the difference between evidence and storytelling.

Chapter takeaway

A first eval should make the next decision less ambiguous. If it cannot do that, it is probably a score-shaped decoration.

References

OpenAI Evals repository

What Is An Eval?

The setup

Picture this

Mental model

Good

Bad

Ugly

Artifact to produce

Eval card review

References

Quiz