The setup
An eval is a structured argument about whether a system is ready for a specific decision. The unit is not "model quality" in the abstract. The unit is a product claim: this assistant can answer billing questions with citations, this classifier can abstain on ambiguous tickets, or this summarizer can preserve the contractual clause that matters.
Picture this
Mental model
Think of an eval as a contract with five parts: task, examples, scoring rule, threshold, and action. If any part is missing, people will fill the gap with vibes, seniority, or the last demo they remember. Great, now your launch process is a group chat.
Good
The good version names the product decision first. It says: "We will expand to 10 percent of traffic if billing-answer citation faithfulness is at least 95 percent on high-risk slices and no critical policy failures appear." The eval has a dataset version, rubric version, scorer version, and a documented owner.
Bad
The bad version runs twenty prompts, picks the best transcript, and calls the model improved. It usually has no fixed threshold and no failure taxonomy. If the answer sounds fluent, everyone relaxes, which is adorable until production users ask different questions.
Ugly
The ugly reality is that stakeholders often want one number. You may need one headline number, but it must point to slice detail, failure examples, and an action. The headline is the doorway, not the building.
Artifact to produce
Write an eval card with: decision, task boundary, dataset source, slices, scoring rule, threshold, refresh cadence, and rollback owner.
Eval card review
| Question | Why it matters |
|---|---|
| Does the eval name the decision it informs? | Without this, the result is trivia. |
| Are task, examples, metric, threshold, and action all explicit? | Missing pieces become hallway debate. |
| Can another reviewer rerun the same procedure? | Repeatability is the difference between evidence and storytelling. |
Chapter takeaway
A first eval should make the next decision less ambiguous. If it cannot do that, it is probably a score-shaped decoration.