Vibe Check Problem

Very Easy50 pts0 solves
You test your LLM app by chatting with it and checking if it 'feels right.' This doesn't scale. What replaces manual testing? Flag format: CONGRESS{[replacement]} Example: CONGRESS{larger_model}
Hint
A reproducible set of tests that run automatically.