The setup
CI evals are where product intent meets developer workflow. They can prevent regressions before rollout, but only if they are fast enough, deterministic enough, and important enough that people respect the red light.
Picture this
Mental model
Split checks into tiers: unit-level contracts, cheap golden examples, expensive judge runs, and offline batch suites. Not every eval belongs in every pull request. Blocking checks should be high-signal and low-flake; advisory checks can be slower and broader.
Good
The good version pins model/tool stubs where possible, uses fixed fixtures, keeps critical checks small, stores failure artifacts, and has an owner for flaky tests. It blocks critical regressions and reports broader quality trends separately.
Bad
The bad version runs a giant nondeterministic eval on every commit. It fails randomly, developers rerun until green, and soon the team learns the sacred ritual: ignore CI unless it fails twice. Very mature.
Ugly
The ugly reality is that some LLM behavior is inherently variable. Use tolerance bands, repeated sampling only where useful, and human review for ambiguous deltas. Do not pretend a flaky judge is a compiler.
Artifact to produce
Define CI policy: blocking checks, advisory checks, maximum runtime, retry policy, artifact capture, and flake owner.
CI review
| Question | Why it matters |
|---|---|
| Which checks block merges? | Blocking checks need high trust. |
| What failure artifact does CI preserve? | Developers need examples, not vibes in red text. |
| Who owns flaky evals? | Unowned flakes become ignored gates. |
Chapter takeaway
A flaky eval in CI teaches developers one lesson: rerun until green. Stunning pedagogy, terrible guardrail.