The 2024 Agent Benchmark Of Web Tasks
ArchiveHard
Zhou et al. (2023-24) released a realistic, self-hostable collection of websites (e-commerce, forums, wikis, code repos) with 800+ tasks to evaluate web-acting agents. Name the benchmark. Flag format: CONGRESS{name}. Example: CONGRESS{browsenet}.
Show hint
The environment type + the fighting space from Rome.
Archive — no submissions accepted
This challenge is preserved for reference. Play live challenges at /challenges.