The DeepSeek-R1 Training Trick
ArchiveMedium
DeepSeek-R1's reasoning training used a PPO variant that replaces the value function with a per-prompt group baseline, saving compute. Four-letter acronym. Flag format: CONGRESS{acronym}. Example: CONGRESS{rloo}.
Show hint
Group + RL + PO.
Archive — no submissions accepted
This challenge is preserved for reference. Play live challenges at /challenges.