Searching protocol for "policy-training"
Robust RLHF with group-relative policy training.
High-performance RL training framework