Archive
Evaluation & Benchmarks

Vibe Check Problem

Archive
Very Easy
50pts40 solves
You test your LLM app by chatting with it and checking if it 'feels right.' This doesn't scale. What replaces manual testing?
Show hint
A reproducible set of tests that run automatically.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.