Archive
Evaluation & Benchmarks

The Paper That Trained A Judge

Archive
Medium
150pts0 solves
Zheng et al. (2023) released an 80-question multi-turn benchmark using GPT-4 as judge, bringing attention to the 'LLM-as-a-judge' evaluation paradigm. Name the benchmark. Flag format: CONGRESS{name}. Example: CONGRESS{dialogbench}.
Show hint
Two letters + hyphen + the class.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.