Human Preference Prediction

Hard200 pts0 solves

RLHF's reward model is trained on human comparisons. It doesn't learn correctness. What does it predict? Flag format: CONGRESS{[prediction]} Example: CONGRESS{factual_accuracy_score}

Hint

Given two outputs, which one would a human choose?