Archive
Fine-Tuning & Training

The Contrastive Cousin Of RLHF

Archive
Very Easy
50pts0 solves
Rafailov et al. (2023) replaced the reward model + PPO dance of RLHF with a single classification-style loss over preference pairs. Three-letter acronym?
Show hint
First two letters = 'Direct', last = what RLHF does to the policy.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.