The Paper That Argued Against Four-Choice Tests
ArchiveExpert
Sclar et al. (2023) showed that swapping the letter-ordering, option separators, or answer prompt on MMLU can swing a model's score by 10+ points — naming the phenomenon. Two-word phrase. Flag format: CONGRESS{two-words}. Example: CONGRESS{prompt fragility}.
Show hint
How much the score depends on the shape of the test.
Archive — no submissions accepted
This challenge is preserved for reference. Play live challenges at /challenges.