RLHF Pipeline

Medium150 pts0 solves

RLHF has 3 stages: supervised fine-tuning, training a model from human preferences, then optimizing with reinforcement learning. Name all 3 stages. Flag format: CONGRESS{[stage1],[stage2],[stage3]} Example: CONGRESS{pretrain,finetune,deploy}

Hint

SFT first, then learn what humans like, then optimize for it.