DAPO 的核心贡献是一套可复现的 long CoT reasoning RL recipe:在 Qwen2.5 32B base 上,用基于 verl 的 GRPO 变体、规则奖励、DAPO Math 17K 数据、Clip Higher、Dynamic Sampling、Token level Policy Gradient Loss 和 Overlong Reward Shaping,将 AIME 2024 avg@32 提升到...
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei Ying Ma, Ya Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, Mingxuan Wang