DPO

Hard200 pts0 solves

DPO (Direct Preference Optimization) achieves RLHF-like results but skips training a separate reward model. What is DPO's key simplification? Flag format: CONGRESS{[simplification]} Example: CONGRESS{no_human_data_needed}

Hint

One fewer model to train. Preferences are used directly.