The Model Trained From Preferences
ArchiveVery Easy
The training technique that shapes a model to align with human preferences via a reward model + policy gradient loop is known by this four-letter acronym. Flag format: CONGRESS{acronym}. Example: CONGRESS{sft}.
Show hint
Reinforcement Learning from Human ...
Archive — no submissions accepted
This challenge is preserved for reference. Play live challenges at /challenges.