The Paper That Trained A Judge
ArchiveMedium
Zheng et al. (2023) released an 80-question multi-turn benchmark using GPT-4 as judge, bringing attention to the 'LLM-as-a-judge' evaluation paradigm. Name the benchmark. Flag format: CONGRESS{name}. Example: CONGRESS{dialogbench}.
Show hint
Two letters + hyphen + the class.
Archive — no submissions accepted
This challenge is preserved for reference. Play live challenges at /challenges.