RLHF

Medium150 pts0 solves
RLHF (Reinforcement Learning from Human Feedback) works in 3 stages: 1. Supervised fine-tuning 2. Train a ______ model on human preferences 3. Optimize the LLM with RL using that model What guides the RL optimization? Flag format: CONGRESS{guide_in_snake_case}
Hint
Human preferences are encoded into a model that scores outputs.