Evaluation & Benchmarks

The Paper That Trained A Judge

Archive

Medium

150pts0 solves

Zheng et al. (2023) released an 80-question multi-turn benchmark using GPT-4 as judge, bringing attention to the 'LLM-as-a-judge' evaluation paradigm. Name the benchmark. Flag format: CONGRESS{name}. Example: CONGRESS{dialogbench}.

Show hint

Two letters + hyphen + the class.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.