The setup
The dataset is the terrain your eval can see. If the terrain is all sunny happy-path examples, your score is a vacation brochure. Good eval datasets include representative traffic, high-risk slices, edge cases, and intentionally hard examples.
Picture this
Mental model
Use multiple sets: a development set for iteration, a golden set for stable regression, a stress set for known hazards, and a sampled production-review set for freshness. Mixing them is possible, but only if you enjoy making every future score suspicious.
Good
The good version records source, sampling method, timestamp, inclusion rules, exclusion rules, label owner, and version. It keeps final holdout examples away from prompt tuning and uses slices to ensure rare but important cases are present.
Bad
The bad version copies examples from demos, docs, and last week's bug bash into one file called final_eval.json. It then tunes prompts until that file is green. Congratulations, you invented overfitting, but with JSON.
Ugly
The ugly reality is that real labels are incomplete, production traffic shifts, and sensitive examples may need redaction. Dataset design must include refresh and privacy review, not just row count.
Artifact to produce
Maintain a dataset manifest with version, source, sampling frame, slices, label policy, known exclusions, and allowed use.
Dataset review
| Question | Why it matters |
|---|---|
| Where did each example come from? | Source determines bias and leakage risk. |
| Which examples are reserved from prompt tuning? | Holdout contamination ruins comparability. |
| How will the dataset refresh? | A stale golden set becomes a nostalgia artifact. |
Chapter takeaway
Datasets are not just rows. They are assumptions with IDs. Label them before they start freelancing.