DPO vs RLHF

Hard200 pts0 solves
DPO (Direct Preference Optimization) achieves similar results to RLHF but eliminates the reward model entirely. It directly optimizes the policy from preference pairs. What is DPO's key advantage over RLHF? Flag format: CONGRESS{advantage_in_snake_case}
Hint
No separate reward model means simpler training pipeline.