DPO

Hard200 pts0 solves
DPO (Direct Preference Optimization) achieves RLHF-like results but skips training a separate reward model. What is DPO's key simplification? Flag format: CONGRESS{[simplification]} Example: CONGRESS{no_human_data_needed}
Hint
One fewer model to train. Preferences are used directly.