Agent Evaluation

Hard200 pts0 solves
Evaluating agents is different from evaluating LLMs. Output quality isn't enough. You need to verify the agent accomplished its goal. What should agent evals primarily measure? Flag format: CONGRESS{[primary_metric]} Example: CONGRESS{response_fluency}
Hint
Did the task get done? Not just 'was the text pretty?'