Fine-Tuning & Training

The Model Trained From Preferences

Archive

Very Easy

50pts0 solves

The training technique that shapes a model to align with human preferences via a reward model + policy gradient loop is known by this four-letter acronym. Flag format: CONGRESS{acronym}. Example: CONGRESS{sft}.

Show hint

Reinforcement Learning from Human ...

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.