DPO
Hard200 pts0 solves
DPO (Direct Preference Optimization) achieves RLHF-like results but skips training a separate reward model.
What is DPO's key simplification?
Flag format: CONGRESS{[simplification]}
Example: CONGRESS{no_human_data_needed}
Hint
One fewer model to train. Preferences are used directly.