Eval Writing Challenges
Validators are Python, local, and rude in the best way. Run a single challenge with npm run challenge -- <slug> --track eval-writing or all Eval Writing challenges with npm run challenge -- --all --track eval-writing. Do not use npm run challenge -- --all alone — it defaults to the MCP track, not Eval Writing.
mid
python
Pairwise Harness
~35 min
Compare outputs pairwise without letting order bias quietly win.
boss
python
Regression Dataset Contract
~40 min
Validate eval examples, slices, and expected fields before CI trusts them.
boss
python
Rollout Metrics
~45 min
Turn offline and online signals into a rollout recommendation.
warmup
python
Rubric Conformance
~25 min
Write a deterministic rubric scorer that rejects ambiguous labels.