Archive
Evaluation & Benchmarks

The Paper That Argued Against Four-Choice Tests

Archive
Expert
300pts0 solves
Sclar et al. (2023) showed that swapping the letter-ordering, option separators, or answer prompt on MMLU can swing a model's score by 10+ points — naming the phenomenon. Two-word phrase. Flag format: CONGRESS{two-words}. Example: CONGRESS{prompt fragility}.
Show hint
How much the score depends on the shape of the test.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.