Fine-Tuning & Training

The DeepSeek-R1 Training Trick

Archive

Medium

150pts0 solves

DeepSeek-R1's reasoning training used a PPO variant that replaces the value function with a per-prompt group baseline, saving compute. Four-letter acronym. Flag format: CONGRESS{acronym}. Example: CONGRESS{rloo}.

Show hint

Group + RL + PO.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.