Regression Suites And CI

The setup

CI evals are where product intent meets developer workflow. They can prevent regressions before rollout, but only if they are fast enough, deterministic enough, and important enough that people respect the red light.

Picture this

Good, bad, and ugly paths for eval-in-CI.

Mental model

Split checks into tiers: unit-level contracts, cheap golden examples, expensive judge runs, and offline batch suites. Not every eval belongs in every pull request. Blocking checks should be high-signal and low-flake; advisory checks can be slower and broader.

Good

The good version pins model/tool stubs where possible, uses fixed fixtures, keeps critical checks small, stores failure artifacts, and has an owner for flaky tests. It blocks critical regressions and reports broader quality trends separately.

Bad

The bad version runs a giant nondeterministic eval on every commit. It fails randomly, developers rerun until green, and soon the team learns the sacred ritual: ignore CI unless it fails twice. Very mature.

Ugly

The ugly reality is that some LLM behavior is inherently variable. Use tolerance bands, repeated sampling only where useful, and human review for ambiguous deltas. Do not pretend a flaky judge is a compiler.

Artifact to produce

Define CI policy: blocking checks, advisory checks, maximum runtime, retry policy, artifact capture, and flake owner.

CI review

Question	Why it matters
Which checks block merges?	Blocking checks need high trust.
What failure artifact does CI preserve?	Developers need examples, not vibes in red text.
Who owns flaky evals?	Unowned flakes become ignored gates.

Chapter takeaway

A flaky eval in CI teaches developers one lesson: rerun until green. Stunning pedagogy, terrible guardrail.

References

EleutherAI lm-evaluation-harness

Regression Suites And CI

References

Quiz