Searching protocol for "thinking-aware"
Robust RLHF with group-relative policy training.
Lower-variance RL with leave-one-out baselines.