Human Preference Prediction
Hard200 pts0 solves
RLHF's reward model is trained on human comparisons. It doesn't learn correctness. What does it predict?
Flag format: CONGRESS{[prediction]}
Example: CONGRESS{factual_accuracy_score}
Hint
Given two outputs, which one would a human choose?