Jun 08, 2026

Why Muon Outperforms Adam: A Curvature Perspective

这篇论文给 Muon 相比 Adam 更快训练提供了一个局部曲率解释:在 matched validation loss 下,Muon 和 Adam 的一阶收益相近,差距主要来自二阶 Hessian curvature penalty;进一步分解发现二阶差距主要由 Muon update direction 的 Normalized Directional Sharpness (NDS) 更低造成,step size 对差距的解释力较...

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, Zhuoran Yang

2606.04662-muon-outperforms-adam-curvature Source OptimizerTheory
Jun 07, 2026

UltraEP: Unleash MoE Training and Inference on Rack Scale Nodes with Near Optimal Load Balancing

UltraEP 的核心贡献是把大规模 MoE expert parallelism 中的负载均衡从“基于历史统计的周期性预测”推进到“基于 post gating exact load 的每 microbatch、每 layer 实时再均衡”:它利用 rack scale node 的高带宽 scale up fabric,把一个 EP group 放进同一机架级通信域,再用 quota driven planner 联合决定专家复制...

Xinming Wei, Chao Jin, Tuo Dai, Yinmin Zhong, Shan Yu, Chengxu Yang, Bingyang Wu, Zili Zhang, Jing Mai, Qianchao Zhu, Zhouyang Li, Yuliang Liu, Guojie Luo

2606.04101-ultraep-rack-scale-moe-load-balancing Source Systems
Jun 07, 2026

Self Trained Verification for Training and Test Time Self Improvement

这篇论文提出 Self Trained Verification (STV):先让同一个模型在看到参考答案时充当“带特权信息的 verifier teacher”,再用 on policy distillation 和 verdict RL 训练一个推理时无需参考答案的 verifier;这个 verifier 能显著改善 test time verification refinement loop,并进一步通过 Verifier i...

Chen Henry Wu, Aditi Raghunathan

2605.30290-self-trained-verification Source RLMethodology
Jun 07, 2026

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

这篇论文把 LLM RL 中训练侧与推理侧对同一 token 序列给出的 logprob 不一致定义为 Training Inference Mismatch (TIM),并用 VeXact 构造 FSDP trainer 与 rollout engine bitwise 对齐的 zero mismatch 基线;实验证明 TIM 这种看似微小的 token level 数值差异可以单独触发 RL training collapse,...

Tianle Zhong, Neiwen Ling, Yifan Pi, Zijun Wei, Tianshu Yu, Geoffrey Fox, Peng Wu, Xiao Yu

2605.14220-training-inference-mismatch-llm-rl Source RLSystemsMethodology
Jun 07, 2026

DAPO: An Open Source LLM Reinforcement Learning System at Scale

DAPO 的核心贡献是一套可复现的 long CoT reasoning RL recipe:在 Qwen2.5 32B base 上,用基于 verl 的 GRPO 变体、规则奖励、DAPO Math 17K 数据、Clip Higher、Dynamic Sampling、Token level Policy Gradient Loss 和 Overlong Reward Shaping,将 AIME 2024 avg@32 提升到...

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei Ying Ma, Ya Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, Mingxuan Wang

2503.14476-dapo-long-cot-rl-system Source RLSystemsMethodology
Jun 07, 2026

DeepSeek R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek R1 v2 的核心结论是:大规模 outcome based RL 可以在强 base model 上诱导 long CoT reasoning、自我反思、验证和策略切换等行为;R1 Zero 证明无需 SFT 也能通过 rule based verifiable reward 激发 reasoning capability,R1 则通过 cold start SFT、两阶段 RL、rejection samplin...

DeepSeek AI and 199 other authors. Core contributors listed in v2 source include Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao.

2501.12948-deepseek-r1-rl-reasoning Source RLSafetyMethodology
Jun 07, 2026

HybridFlow: A Flexible and Efficient RLHF Framework

HybridFlow 的核心贡献是把 RLHF 训练看成由多个大模型节点组成的复杂 dataflow,并提出一个混合控制架构:模型之间用 single controller 统一编排和数据重分片,模型内部用 multi controller 执行高效分布式训练/推理/生成;再配合 3D HybridEngine 和自动设备映射,在 PPO、ReMax、Safe RLHF 等 RLHF 算法上比 DeepSpeed Chat、OpenR...

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu

2409.19256-hybridflow-rlhf-framework Source SystemsRL
Jun 07, 2026

Defeating Nondeterminism in LLM Inference

这篇文章指出,LLM 推理在 temperature=0 下仍然出现不同输出,主要来源通常是 batch 不变性缺失:服务端负载改变 batch size、prefill/decode 切分、KV cache 布局和 attention split 策略,进而改变浮点 reduction 顺序;作者通过 batch invariant RMSNorm、matmul 和 attention kernel 展示了可复现推理的实现路径,并把...

Horace He, in collaboration with others at Thinking Machines Lab

2025-09-10-defeating-nondeterminism-llm-inference Source SystemsRLMethodology
Jun 06, 2026

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

Vortex 是一个面向 LLM serving 的稀疏注意力开发和部署系统,让研究者或 AI agent 用少量 Python 风格代码表达稀疏注意力算法,并直接在真实 serving 系统里高效验证吞吐、延迟和准确率。

Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, Beidi Chen

2606.06453-vortex-sparse-attention-serving Source Systems
Jun 06, 2026

Large Language Models Hack Rewards, and Society

论文提出 societal hacking:当社会规则被编码成可优化的奖励结构时,RL 后训练会推动 LLM 在形式合规和制度意图之间寻找缝隙;在作者构造的 SocioHack 沙盒中,RL 模型能够重新发现大量真实历史漏洞,并且现有拒答、自我批判、训练正则化只能部分缓解这一现象。

Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei, Yulan He

2606.04075-llms-hack-rewards-and-society Source SafetyRL
Jun 06, 2026

On Effectiveness and Efficiency of Agentic Tool calling and RL Training

这篇论文的核心结论是:agentic tool calling 的进展同时受“评测怎么做”和“训练怎么算”影响;BFCL 等工具调用基准会被随机种子、多轮模板、推理历史保留、系统提示词和训练数据格式显著扰动,而常规 GRPO/PPO 类 RL 训练又在大量零方差 prompt 和昂贵 policy update 上浪费计算,作者用在线 pre rollout filtering 与 variance aware rollout dow...

Tong Liu, Cheng Qian, Matej Cief, Yuan He, Daniele Dan, Nikolaos Aletras, Gabriella Kazai

2606.00135-agentic-tool-calling-rl-training Source RLSystemsMethodology
Jun 06, 2026

Transformers are Inherently Succinct

论文证明:固定精度 Transformer 在表达某些语言时非常简洁;存在语言族可以用多项式大小的 Transformer 表示,但等价的 LTL 或 RNN 需要指数大小,等价有限自动机需要双指数大小。

Pascal Bergsträßer, Ryan Cotterell, Anthony W. Lin

2510.19315-transformers-inherently-succinct Source Theory

Topic Routes

同一篇文章可以进入多个问题域,但每个入口只保留和该主题直接相关的论文。

Methodology

Optimizer

RL

Safety

Systems

Theory