Archive
Fine-Tuning & Training

The Paper That Re-Finetunes The Reward

Archive
Hard
200pts0 solves
Azar et al. (2023) addressed the DPO reward-model-gap issue by modifying the loss so that perfectly-separable preference pairs do not inflate the logit gap unboundedly. Three-letter acronym. Flag format: CONGRESS{acronym}. Example: CONGRESS{dpo}.
Show hint
First letter = 'identity'.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.