RLHF
Medium150 pts0 solves
RLHF (Reinforcement Learning from Human Feedback) works in 3 stages:
1. Supervised fine-tuning
2. Train a ______ model on human preferences
3. Optimize the LLM with RL using that model
What guides the RL optimization?
Flag format: CONGRESS{guide_in_snake_case}
Hint
Human preferences are encoded into a model that scores outputs.