Evaluation & Benchmarks

Three Rings Of METR

Archive

Expert

300pts0 solves

METR's 2023 dangerous-capability eval framework benchmarks an LLM agent on three capability classes: _____(1) task completion, _____(2) of itself onto another machine, and _____(3) to novel environments. Fill the 3 blanks. Flag format: CONGRESS{1:[word],2:[word],3:[word]}. Example: CONGRESS{1:autonomous,2:replication,3:adaptation}.

Show hint

One tests acting alone, one tests spreading, one tests flexibility.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.