Agentic Architectures

Agent Evaluation

Archive

Hard

200pts40 solves

Evaluating agents is different from evaluating LLMs. Output quality isn't enough. What should agent evals primarily measure?

Show hint

Did the task get done? Not just 'was the text pretty?'

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.