Benchmark Contamination
ArchiveMedium
A model scores 95% on a benchmark but fails on real tasks.
What happened?
Show hint
The test was in the study guide.
Archive — no submissions accepted
This challenge is preserved for reference. Play live challenges at /challenges.