Archive
Fine-Tuning & Training

The DeepSeek-R1 Training Trick

Archive
Medium
150pts0 solves
DeepSeek-R1's reasoning training used a PPO variant that replaces the value function with a per-prompt group baseline, saving compute. Four-letter acronym. Flag format: CONGRESS{acronym}. Example: CONGRESS{rloo}.
Show hint
Group + RL + PO.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.