Evals

Eval Writing Challenges

Validators are Python, local, and rude in the best way. Run a single challenge with npm run challenge -- <slug> --track eval-writing or all Eval Writing challenges with npm run challenge -- --all --track eval-writing. Do not use npm run challenge -- --all alone — it defaults to the MCP track, not Eval Writing.

Pairwise Harness

~35 min

Compare outputs pairwise without letting order bias quietly win.

Regression Dataset Contract

~40 min

Validate eval examples, slices, and expected fields before CI trusts them.

Rollout Metrics

~45 min

Turn offline and online signals into a rollout recommendation.

Rubric Conformance

~25 min

Write a deterministic rubric scorer that rejects ambiguous labels.