Agent Evaluation

Hard200 pts0 solves

Evaluating agents is different from evaluating LLMs. Output quality isn't enough. You need to verify the agent accomplished its goal. What should agent evals primarily measure? Flag format: CONGRESS{[primary_metric]} Example: CONGRESS{response_fluency}

Hint

Did the task get done? Not just 'was the text pretty?'