Archive
Evaluation & Benchmarks

The 2024 Agent Benchmark Of Web Tasks

Archive
Hard
200pts0 solves
Zhou et al. (2023-24) released a realistic, self-hostable collection of websites (e-commerce, forums, wikis, code repos) with 800+ tasks to evaluate web-acting agents. Name the benchmark. Flag format: CONGRESS{name}. Example: CONGRESS{browsenet}.
Show hint
The environment type + the fighting space from Rome.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.