Evaluation & Benchmarks

The 2024 Agent Benchmark Of Web Tasks

Archive

Hard

200pts0 solves

Zhou et al. (2023-24) released a realistic, self-hostable collection of websites (e-commerce, forums, wikis, code repos) with 800+ tasks to evaluate web-acting agents. Name the benchmark. Flag format: CONGRESS{name}. Example: CONGRESS{browsenet}.

Show hint

The environment type + the fighting space from Rome.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.