Agent Evaluation
ArchiveHard
Evaluating agents is different from evaluating LLMs. Output quality isn't enough.
What should agent evals primarily measure?
Show hint
Did the task get done? Not just 'was the text pretty?'
Archive — no submissions accepted
This challenge is preserved for reference. Play live challenges at /challenges.