Archive
Agentic Architectures

Agent Evaluation

Archive
Hard
200pts40 solves
Evaluating agents is different from evaluating LLMs. Output quality isn't enough. What should agent evals primarily measure?
Show hint
Did the task get done? Not just 'was the text pretty?'

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.