UltraEP: Unleash MoE Training and Inference on Rack Scale Nodes with Near Optimal Load Balancing

Source

Title: UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing
arXiv: https://arxiv.org/abs/2606.04101
HTML v1: https://arxiv.org/html/2606.04101v1
PDF: https://arxiv.org/pdf/2606.04101
TeX Source: https://arxiv.org/e-print/2606.04101
Code/Project: 未发现官方公开仓库；论文说明 UltraEP 实现为 standalone runtime，并接入 Megatron-LM 与 SGLang。
Authors: Xinming Wei, Chao Jin, Tuo Dai, Yinmin Zhong, Shan Yu, Chengxu Yang, Bingyang Wu, Zili Zhang, Jing Mai, Qianchao Zhu, Zhouyang Li, Yuliang Liu, Guojie Luo
Submitted: 2026-06-02
Current version read: v1, submitted 2026-06-02
PDF pages: 12
DOI: https://doi.org/10.48550/arXiv.2606.04101
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

作者与关系

Xinming Wei: School of Computer Science, Peking University.
Chao Jin: School of Computer Science, Peking University。
Tuo Dai: RedNote。
Yinmin Zhong: School of Computer Science, Peking University。
Shan Yu: Shanghai AI Laboratory。
Chengxu Yang: Independent Researcher。
Bingyang Wu: School of Computer Science, Peking University。
Zili Zhang: School of Computer Science, Peking University。
Jing Mai: School of Computer Science, Peking University。
Qianchao Zhu: Independent Researcher。
Zhouyang Li: Independent Researcher。
Yuliang Liu: Independent.
Guojie Luo: School of Computer Science, Peking University.

阅读目标与判断边界

本笔记关注：

UltraEP 为什么要从 history-based balancing 转向 exact-load real-time balancing。
quota-driven planning 如何同时解决 expert replication 与 token reroute。
RSN-native communication 如何降低 hot-path expert-state transfer 开销。
UltraEP 对 MoE 训练、serving prefill、RL rollout/training 系统的启发。
它和本地已归档的 TIM/VeXact、HybridFlow、DeepSeek-R1、DAPO 的关系。

判断边界：

论文是 arXiv v1，实验环境是 public-cloud RSN cluster 和内部 production training stack；硬件、网络、模型和运行时细节对结论影响很大。
论文主要处理 MoE training 和 serving prefill；decode 阶段因 memory-bound 特性受 compute-side imbalance 影响较小，UltraEP 的技术对象集中在 MoE expert-side balancing。
代码未公开，本笔记依据 arXiv HTML、PDF、TeX source 和论文内评测描述。
production RefMoE-288B 和 in-house corpus 无法外部复现；相关结论应视为生产部署证据，社区可复现性需要等待更多公开材料。

论文脉络

1. 问题背景

MoE 模型通过 sparse activation 扩大总参数量。专家并行 EP 把 experts 分布到多个 GPU，token 根据 router/gating 结果通过 all-to-all 发到对应 expert 所在 rank。

随着 frontier MoE 模型走向 fine-grained experts，例如数百个较小专家、top-k 路由、EP32/EP64，expert load imbalance 会被放大成三个系统问题：

热专家所在 rank 的 expert compute 变成 straggler。
token all-to-all 的 sender / receiver 出现热点。
接收端 activation memory 出现峰值，增加 OOM 风险和 activation checkpointing 压力。

已有 EPLB 这类系统通常基于历史 load 预测热门 expert，并周期性调整 expert placement 或复制 hot experts。这个做法依赖 load pattern 稳定。论文的 load analysis 显示，prefill 阶段 expert popularity 会随 batch、semantic domain、layer 快速变化；training 阶段早期 router 尚未稳定，后期 auxiliary loss / routing bias 仍会持续制造 oscillation。历史统计容易过期，甚至会把负载均衡变成新的 straggler 来源。

2. 核心假设或切入点

UltraEP 的切入点是：post-gating exact load 才是真正可靠的负载信号。问题在于 exact load 只有 gating 后才知道，planning、expert replication、reroute 和 expert-state transfer 都会落到 forward critical path 上。

传统 RDMA cluster 中，一个 EP group 跨多个服务器甚至多个 rack，跨节点搬 expert weights / gradients 太贵。Rack-scale node 改变了物理条件：一个 rack 里 64+ GPUs 通过高带宽 scale-up fabric 互连，EP group 可以被放在同一个高带宽域内。这样 exact-load hot-path balancing 具备工程可行性。

因此 UltraEP 的系统假设是：

cross-rack scaling 交给 PP/DP，intra-rack 保持 EP group。
real-time balancing 只在 RSN 内部完成。
control plane 必须在 gating 和 token dispatch 之间快速给出高质量计划。
data plane 必须把不规则、稀疏、频繁变化的专家状态迁移做成低开销 device-side streaming。

3. 方法 / 系统 / 理论框架

3.1 Expert layout 与内存管理

UltraEP 区分 logical expert 和 physical expert。

logical expert: 模型定义中的专家身份。
physical expert: 某个 rank 上实际存放的 expert slot。

每个 rank 保留 main slots 和 redundant slots：

main slot 固定存放原始 logical expert。
redundant slot 可临时存放某个 hot expert 的 replica。
main expert placement 不移动，只通过 replication 扩容热点 expert。

Replication-only 的原因是大 EP 下每个 rank 的本地 main experts 很少，重排 main experts 带来的收益有限，同时会引入复杂状态迁移。复制 hot experts 更直接地扩展瓶颈 expert 的服务能力。

内存设计上，redundant slots 不持久保存 optimizer state，并且跨 MoE layers 复用 weight / gradient buffers。论文给出 Qwen3-235B-A22B 的例子：94 个 MoE layers、128 experts 时，一个 redundant slot 若逐层保留需要 3.3 GB weights 和 6.6 GB gradients；复用后每 rank 只需 36 MB weights 和 72 MB gradients。代价是 forward critical path 上必须按 layer 及时 materialize replica weights。

3.2 Forward / backward pipeline

Forward：

复用已有 notify-dispatch 收集 global routing information。
每个 rank 拿到相同 exact load matrix 后，在 GPU 上确定性求解同一份 replication / reroute plan。
Reroute 把 token-to-logical-expert 转成 token-to-physical-expert。
主 expert weights 分发到远端 replicas。
等 weight distribution 完成后执行 token all-to-all 和 expert compute。

Backward：

复用 forward 的 balancing metadata，不重新求解。
在 backward 中 rematerialize redundant expert weights。
replica 产生的 gradients reduce 回 main expert 的 gradient buffer。
optimizer 只更新 main expert，保证训练语义等价。

这个设计的关键是：replicas 只改变物理执行路径，不改变模型参数更新语义。

3.3 Optimization formulation

输入：

EP group ranks $R$ 。
logical experts $E$ 。
main expert home rank $h(e)$ 。
每 rank redundant slot 数 $N_{\mathrm{slot}}$ 。
runtime load matrix $\Lambda=\{\lambda_{r,e}\}$ ，表示源 rank $r$ 发给 logical expert $e$ 的 token load。

输出：

quota table $U=\{u_{e,t}\}$ ：expert $e$ 在 host rank $t$ 的 physical instance 最终承载多少 token。
reroute split $Q=\{q_{r,e,t}\}$ ：源 rank $r$ 发往 expert $e$ 的 token 如何分配到 host rank $t$ 。

Forward latency 目标由五部分构成：

T_{\mathrm{solve\_rep}}^{\mathrm{fwd}} + \max\left(T_{\mathrm{reroute}}^{\mathrm{fwd}}, T_{\mathrm{w\_distr}}^{\mathrm{fwd}}\right) + T_{\mathrm{tok\_a2a}}^{\mathrm{fwd}} + T_{\mathrm{moe}}^{\mathrm{fwd}}

MoE compute 与最忙 rank 的 post-reroute load 相关：

T_{\mathrm{moe}}\propto \max_r \sum_e u_{e,r}

token all-to-all 同时受最忙 sender 和 receiver 影响：

T_{\mathrm{tok\_a2a}}\propto \max_r \max\left(\sum_e \lambda_{r,e}, \sum_e u_{e,r}\right)

weight distribution 由承载最多 hot main experts fan-out 的 rank 主导：

T_{\mathrm{w\_distr}}\propto \max_r \sum_{e\in E_r} (|H(e)|-1)

约束包括：main placement 不变、每 rank redundant slots 有限、同一 logical expert 不能在同一 rank 出现多次、新 replica 至少承载 $u_{\min}$ token、replica 引入的 backward communication 需要尽量被计算 overlap 隐藏。

3.4 Quota-driven planner

UltraEP 的 planner 直接求最终 per-instance quota，避免“先复制 expert，再用 round-robin 或启发式 reroute”的两段式误差。

核心步骤：

计算每个 expert 总 load $\lambda_e$ 。
计算每个 rank 初始 load $\ell_r$ 。
在 average load 和 max load 之间 binary search 最小可行 threshold $\tau$ 。
对某个 $\tau$ ，定义每个 rank 的 excess 和 slack：

\mathrm{exc}_r(\tau)=\max(\ell_r-\tau,0)

\mathrm{slk}_r(\tau)=\max(\tau-\ell_r,0)

贪心 feasibility oracle 访问 overloaded ranks 和 hot main experts，把 expert load 迁移到 slack 最大且满足 slot/no-duplicate/ $u_{\min}$ 的 rank。
如果所有 excess 都被 drain，说明当前 $\tau$ feasible，继续搜索更小 threshold。
得到 quota $U$ 后，materialize slot assignment $X$ 。

这个 planner 的本质是：让 replica 的创建和它实际承载的 token quota 绑定。一个 replica 只有在能吸收足够 load 时才会被创建，从而减少无效 replica、复制流量和 slot 浪费。

3.5 Reroute with locality

Quota 固定后，reroute 只负责把 source-wise demand 分解到 physical instances。

UltraEP 先让 token 来源 rank 优先消耗同 host rank 的 quota，从而减少跨 rank traffic。剩余 demand 按 residual quota 比例分配，并用 deterministic rounding 保证：

\sum_t q_{r,e,t}=\lambda_{r,e}

\sum_r q_{r,e,t}=u_{e,t}

token assignment 由 cumulative quota 和 prefix scan 实现。dispatch 时，第 j 个 (source rank, logical expert) token 通过轻量 upper-bound lookup 找到目标 physical expert。

3.6 GPU-native solving

UltraEP 将 quota solving 放在 GPU 上，避免 CPU synchronization 和 host-device metadata transfer。实现方式是用一个 cooperative thread block 在单个 SM 上工作，把 load matrix 和 placement state 放到 shared memory；不同 warps 评估 threshold probes，用 warp-level reductions 找满足 slack、slot 和 no-duplicate 约束的目标 rank。

这把一个看似组合优化的 CPU-side search 压缩成 GPU-resident feasibility problem，并直接输出 slot mapping、quotas 和 reroute metadata。

3.7 RSN-native balancing communication

UltraEP 的 expert replication 属于随每个 microbatch / layer 改变的稀疏 transfer graph，通信形态比规则 collective 更动态。它处理两类数据移动：

forward weight distribution / backward weight redistribution：main expert 到 remote replicas。
backward gradient reduction：replicas gradients 回 main expert。

Persistent Tile Streaming：

把 expert weight / gradient 切成固定 size tiles。
placement plan 编译成 device-resident transfer tasks。
persistent kernel 的 thread blocks 从 global task stream 拉 tile。
weight distribution 中，source tile 只 stage 一次到 shared memory，再写入多个 remote destinations。
gradient reduction 中，从 remote replicas 读 gradient tiles 并 accumulate 到 main expert gradient buffer，再清空 replica buffer 以复用。
double buffering 把 task lookup、address translation 和 synchronization 隐藏在 tile pipeline 中。

Overlap-aware footprint：

forward critical path 上使用更多 resident thread blocks 提升 in-flight transfer。
backward overlap 路径限制 SM residency，避免干扰 Wgrad/Dgrad 等 compute kernels。

Chunk Streaming Relay：

hot expert fan-out 可能让 source rank 成为 bottleneck。
当某 expert replica count 超过阈值时，UltraEP 构造两阶段 relay tree。
source 先把 chunks 发给少量 relay ranks，relay 收到 chunk 后立即转发给 leaves。
relay frontier 规模接近 $\sqrt{|H(e)|-1}$ ，以平衡两阶段 fan-out。
relay selection 根据各 rank 当前 sending volume 选择低负载 ranks，让 hot expert fan-out 被分摊到有 spare bandwidth 的 ranks。

3.8 Implementation

UltraEP 是 standalone runtime，核心库约 9.6K 行 C++、device kernels 和 Python。论文中：

训练接入 Megatron-LM，额外代码小于 1K 行。
serving 接入 SGLang，额外代码小于 1K 行。
token dispatch/combine 使用 DeepEP hybrid-ep branch，版本 v1.2.1+7febc6e。
使用 GPU-initialized one-sided peer-memory access，在初始化时分配 symmetric buffers，并把 peer handles 转成 device-resident address tables。
redundant experts 不进入 framework parameter/gradient buckets、optimizer state 和 checkpoints。
backward 通过 virtual layer ID 追踪 forward balancing plan，支持 PP 和 virtual PP 中的 in-flight microbatches。

4. 结论链条

论文的证据链是：

大 EP + fine-grained MoE 会放大 expert load imbalance，影响 compute、token all-to-all 和 activation memory。
prefill 和 training 的 expert load 都具有强动态性，history-based EPLB 可能过期。
RSN 让 EP group 可以留在高带宽 scale-up fabric 内，使 exact-load hot-path balancing 具备可行性。
quota-driven planning 直接优化 post-reroute rank load，减少无效 replica 和复制流量。
GPU-native solving、persistent tile streaming 和 relay fan-out mitigation 把 planning/replication overhead 压到可接受范围。
端到端实验显示 UltraEP 接近 force-balanced ideal，并在 production 2560-GPU MoE training 中保持吞吐和 convergence。

关键实验/定理

结果 1：load analysis 显示 expert load 非平稳

设置：serving prefill 使用 Qwen3-235B，top-8 activated of 128 experts，EP=64；training 使用 GLM4.5-106B-A12B 和 DeepSeek-V3，EP64。
指标：per-expert load distribution、inter-rank imbalance、EPLB 前后效果。
结果：prefill load 随 forward steps、data domains、layers 快速变化；training 早期 load 不稳定，后期仍受 auxiliary loss / routing bias 与 microbatch jitter 影响；EPLB 基于历史统计时仍留下显著 residual imbalance，部分情况下会加剧 spike。
解读：exact post-gating load 是更可靠的系统均衡输入；周期性历史预测在 large-EP fine-grained MoE 上很脆弱。

结果 2：端到端 training throughput 接近 ideal

设置：public-cloud RSN cluster；训练模型包括 GLM4.5-106B-A12B、Qwen3-235B-A22B、DeepSeek-V3-671B-A37B；128 或 256 GPUs；从 late-stage checkpoints resume，跑 20 global batches。
指标：TFLOPS/GPU、rank imbalance、相对 Megatron-LM 的 throughput。
结果：UltraEP 平均达到 94.6% 的 training ideal performance；post-balancing imbalance 保持在 1.01-1.03；EPLB、LPLB、EPLB+、UltraEP 分别比 Megatron-LM 提升 20%、12%、29%、42%。
解读：UltraEP 的 real-time exact balancing 在不同 MoE 架构上都能降低 straggler，同时 overhead 没有抵消收益。

结果 3：serving prefill 提升更明显

设置：Qwen3-235B 与 GLM4.7-358B；一个 rack；EP64 或 EP40；chunked prefill size 4K；STEM 与 Mixed realistic reasoning workloads；Poisson arrival process。
指标：RPS-TTFT tradeoff、prefill throughput、inter-rank imbalance。
结果：UltraEP 在 prefill 达到 ideal 的 90%-97%；相对 SGLang 平均 1.56x throughput，相对 EPLB 1.29x；相比 EPLB+ 仍有 5%-24% gain；realized imbalance 为 1.01-1.04。
解读：serving prefill 的 semantic domain、prompt length、batch composition 与 arrival pattern 更动态，real-time exact balancing 的收益更大。

结果 4：latency breakdown 显示 hot-path overhead 小

设置：Qwen3-235B-A22B training，分解每个 MoE layer 的 forward / backward latency。
指标：attention、MoE communication、MoE compute、extra latency。
结果：UltraEP 相对 ideal 的 non-MoE extra latency forward 为 0.33 ms，backward negligible，占总 latency 1.8%；MoE compute 接近 ideal；token all-to-all forward/backward 分别比 ideal 高 33% 和 10%，原因是现实 routing 不均匀。
解读：remaining gap 主要来自真实 routing 的不规则 token traffic，rank-level imbalance 和 UltraEP 自身 hot-path overhead 已被压到较低水平。

结果 5：activation memory peak 接近 ideal

设置：training 和 prefill 两种场景，比较 MoE-related activation peak。
指标：peak GPU memory、MoE receive-token activations。
结果：无 balancing 时，MoE activation peak 相对 ideal 在 training 中高 2x，在 serving 中高 11x；UltraEP flatten receiving hotspots 后接近 ideal。
解读：expert load balancing 不只提升吞吐，也减少 activation memory spike、OOM risk 和 activation checkpointing 需求。

结果 6：quota solver 优于 EPLB+ planner

设置：同样使用 exact load 和 UltraEP communication，只替换 planner 为标准 EPLB + round-robin reroute 形成 EPLB+。
指标：result imbalance、solving time、redundant slots、max replica fan-out、in-flight token ratio。
结果：平均 result imbalance 从 EPLB+ 的 1.19 降到 UltraEP 的 1.03；solving time 0.153 ms 降到 0.111 ms； $\sum_e |H(e)|$ 从 107 降到 45；max fan-out 从 8.5 降到 6.8；in-flight token ratio 从 99.9% 降到 96.0%。
解读：直接优化 post-reroute quota 比“按 pre-reroute hotness 复制 expert”更接近真实目标，也更节省 slots 和通信。

结果 7：RSN-native communication 加速 expert replication

设置：Qwen3-235B、EP64、不同 imbalance level，比较 PyTorch distributed batch send/recv、DeepEP、no-relay ablation 和 UltraEP。
指标：expert-weight distribution latency。
结果：UltraEP 相比 torch.distributed 和 DeepEP 加速 3.1x-5.5x；高 fan-out 下 relay 额外带来 1.3x-1.8x；fan-out 增大时 UltraEP latency 约保持在 0.28 ms，no-relay 近似线性增长。
解读：专家复制流量和普通 token all-to-all 的通信形态不同，RSN 上需要专门处理 tile streaming 与 fan-out。

结果 8：2560-GPU production MoE training

设置：RefMoE-288B-A16B，256 experts top-8，EP32-PP4-DP20，2560 GPUs / 40 racks，15T tokens，内部 training stack。
指标：throughput、loss curve、relative ideal。
结果：UltraEP 稳定达到超过 92% 的 force-balanced throughput；相对 no-balancing 平均提升 9.6%；loss curve 沿预期 pretraining trajectory。
解读：UltraEP 只改变物理执行逻辑并保持训练语义，在千卡级生产训练中展示可扩展性。

证据链强度评估

强证据

论文覆盖 training 和 serving prefill 两类 workload，模型规模从 106B 到 671B，并包含 Qwen3、DeepSeek-V3、GLM 和内部 RefMoE。
评测不只给 throughput，还拆解 latency、activation memory、solver ablation、communication ablation 和 production training convergence。
EPLB+ 对照有价值：它隔离出 exact-load quota solver 相对标准 EPLB + round-robin reroute 的贡献。
production 2560-GPU / 15T tokens 结果说明系统在真实大规模训练中经过压力测试。

中等强度证据

代码未公开，核心 runtime、GPU kernels、RSN peer-memory implementation 和 production stack 需要等待复现材料。
public-cloud RSN cluster 的具体硬件和网络实现对结果影响大；论文提供关键带宽和显存指标，但无法完全重建环境。
serving 主要评估 prefill；decode、mixed prefill-decode scheduling、PD disaggregation 组合下的收益需要额外验证。
Qwen3 / GLM / DeepSeek-V3 的路由特征具代表性，但 MoE architecture、top-k、expert granularity 和 router regularization 会改变 imbalance 形态。

需要谨慎的推论

UltraEP 的 near-ideal 结果依赖 EP group 能放进 RSN scale-up domain；普通跨节点 RDMA cluster 上 hot-path expert replication 成本可能过高。
replication-only 在 large-EP fine-grained MoE 上合理；在本地 main experts 数更多、expert size 更大或 slot budget 更紧的架构上，reordering 与 replication 的权衡可能变化。
Exact-load balancing 会改变物理执行路径和 token-to-physical-expert mapping；对数值确定性、MoE kernel path、RL sampler/trainer consistency 的影响需要结合 TIM/VeXact 思路复验。
production RefMoE 结果证明部署可行性，但模型、数据和内部 stack 不公开，不能直接作为社区复现基准。

OpenReview / 审稿意见吸收

Venue status: 当前档案未记录公开 peer-review 状态。
Public reviews: 当前档案未记录可可靠匹配的 OpenReview / ARR / 会议 reviewer comments。
Ratings / confidence: 无公开评分可用于校准。
Reviewer consensus: 暂无。
Main criticisms: 暂无公开 reviewer 质疑可引用；可信度主要由论文、技术报告、项目证据和本地一致性检查决定。
Author response: 暂无公开 rebuttal 记录。
对本文可信度的影响: 按未完成公开审稿吸收处理，结论需要依赖实验设置、baseline 强度、复现证据和跨论文一致性校准。

本地讨论补充

1. 讨论收敛点

UltraEP 是本地档案中第一篇专门围绕 rack-scale nodes 与 large-EP MoE load balancing 的论文。它补上了 DeepSeek-R1 / DAPO / TIM / tool-calling RL 论文中 implicit 的大 MoE 系统成本问题。
这篇论文的核心是把 post-gating exact load、replica quota、token reroute 和专家状态搬运放到同一个 hot-path co-design 里，超出单个 all-to-all kernel 优化的范围。
它将 MoE serving 与 MoE training 统一到同一个物理问题：每层每个 microbatch 的 realized expert load 决定 rank-level straggler、activation memory 和 communication skew。

2. 修正后的理解

Algorithm-side load balancing loss 与 system-side real-time balancing 目标不同。路由辅助损失让 expert utilization 在训练分布上更健康，UltraEP 处理的是实际 microbatch 中已经发生的 realized imbalance。
Serving prefill 是 UltraEP 的重点服务场景，原因是 prefill compute-bound 且影响 TTFT；decode 更 memory-bound，compute-side expert imbalance 通常被 memory access latency 稀释。
UltraEP 和 TIM/VeXact 的关系需要谨慎：UltraEP 追求物理执行均衡，TIM/VeXact 追求 rollout/trainer logprob exactness。未来 MoE RL 系统可能需要同时满足 load balancing 和 deterministic/batch-invariant constraints。

3. 后续复验指标

load dynamics：per-layer / per-microbatch expert load entropy、max/mean per-expert load、rank-level imbalance p50/p95/p99。
planner：solving latency、threshold gap to ideal、redundant slot utilization、created replicas with low quota、reroute locality ratio。
communication：weight distribution latency、gradient reduction latency、fan-out degree distribution、relay activation ratio、RSN bandwidth utilization。
memory：receive-token activation peak、redundant buffer footprint、OOM rate、activation checkpointing overhead。
training equivalence：replica gradient reduction correctness、optimizer-state isolation、loss curve vs no-balancing baseline。
RL 系统相关：rollout/trainer logprob mismatch under balancing、MoE expert routing flip rate、batch-invariant kernel compatibility、sampling reproducibility。

4. X thread 版本

UltraEP 值得看的一点：

它把 MoE 负载均衡从“提前猜 hot experts”，推进到 gating 之后按真实 token load 现场处理。

大 EP 下，router 的一点偏斜会被放大成 rank straggler、all-to-all 热点和 activation memory 尖峰。

历史统计很难跟上 batch、layer、domain 的变化。placement 一过期，balancer 也会制造尾延迟。

UltraEP 的做法很系统：拿到 exact load 后，用 quota planner 决定 expert replicas 和 token split，再把 token reroute 到 physical experts。

难点在 critical path。

plan 要快，weights / gradients 也要现场搬。RSN 的 rack-scale scale-up fabric 让这件事终于有工程空间。

数字也很硬：106B-671B MoE，94.3% ideal throughput，rank imbalance 从 1.30-4.01 压到 1.01-1.04。

这篇最有意思的地方在这里：MoE 系统优化正在进入 runtime load realization 层。

主要启发

大规模 MoE 的性能瓶颈已经从单纯 GEMM / attention kernel 扩展到 runtime load realization：router 产生的实际 token distribution 会直接决定每个 rank 的 compute、communication 和 activation memory。
历史预测型 balancing 在 fine-grained MoE 和真实流量下容易过期。post-gating exact load 是更强信号，但需要 RSN 这类高带宽物理拓扑支撑。
quota 是一个很好的系统抽象：它同时描述“哪个 expert 被复制”和“每个 replica 实际承担多少 token”，减少无效复制和 reroute 后残余 imbalance。
对训练系统，物理复制必须和数学语义隔离：replica 没有 optimizer state，梯度 reduce 回 main expert 后再更新，这是系统优化能进入训练闭环的前提。
对未来 RL training / rollout 系统，MoE load balancing、serving throughput、sampler determinism、trainer consistency 会相互牵制，需要联合设计。

局限

UltraEP 依赖 RSN 级 scale-up fabric；在普通多节点 RDMA cluster 上，hot-path expert-state transfer 可能无法摊平。
代码和 runtime 未公开，社区难以复现 GPU-native solver、persistent tile streaming 和 relay kernels。
serving 重点是 prefill，decode、PD disaggregation、continuous batching 与 mixed workload 下的完整端到端收益仍需扩展评测。
production RefMoE-288B、in-house corpus 和内部 training stack 不公开，2560-GPU 结果只能作为部署证据。
Exact-load real-time balancing 的数值确定性影响没有深入讨论；MoE RL 场景需要进一步验证 rollout/trainer consistency。
replication-only 选择适合论文设定下的大 EP / fine-grained experts，其他 MoE 拓扑可能需要重新比较 replication、reordering、prefetch 和 routing-side regularization。
relay 和 tile streaming 依赖 RSN peer-memory semantics 与具体 GPU fabric，移植到不同厂商硬件和网络协议需要额外工程。

跨论文关系

与 2605.14220 的作者关系：未发现作者重叠。方法关系强。TIM/VeXact 关注 rollout/trainer logprob mismatch 与 batch-invariant kernels，UltraEP 关注 MoE expert balancing 和 routing-driven physical execution。UltraEP 可能改变 token-to-physical-expert 的执行路径，未来 MoE RL 需要联合验证 balancing 与 zero-mismatch。
与 2501.12948 的作者关系：未发现作者重叠。主题关系强。DeepSeek-R1/R1-Zero 使用 DeepSeek-V3 MoE 和大规模 rollout/training；UltraEP 直接评估 DeepSeek-V3-671B-A37B，并引用 DeepSeek-V3/R1、EPLB、DeepEP、DeepGEMM 等 DeepSeek 系统生态。
与 2503.14476 的作者关系：未发现作者重叠。主题关系中等到强。DAPO 的 long-CoT RL 训练会放大 MoE 与长序列系统成本；UltraEP 的 serving workload 使用 DAPO-Math-17K 构成 STEM workload 的一部分，并提供 MoE training/prefill 的负载均衡系统视角。
与 2409.19256 的作者关系：未发现作者重叠。系统关系中等。HybridFlow/VERL 关注 RLHF/RLVR 多模型 dataflow 与训练/推理调度，UltraEP 关注 MoE EP group 内的实时负载均衡；未来大 MoE RLHF/RLVR 系统可以把 UltraEP 作为 actor rollout / training backend 的底层优化。
与 2606.00135 的作者关系：未发现作者重叠。主题关系中等。Tool-calling RL 论文关注 rollout waste 与 policy update cost，UltraEP 关注 MoE prefill/training 的 expert-side imbalance；两者都说明 agent/RL 系统效率取决于 workload structure，而不只取决于模型 FLOPs。
与 2025-09-10 的作者关系：未发现作者重叠。关系中等。Thinking Machines 文章关注 batch-invariant inference 与可复现 serving；UltraEP 在 MoE 中动态 reroute physical experts，后续需要明确这种 balancing 是否与 deterministic inference / sampler consistency 兼容。
与 2605.30290、2606.04075、2605.31514、2510.19315 的关系较弱，主要通过 Qwen3、LLM 系统、优化闭环或 Transformer/MoE 背景连接。

Reference Intake Brief

Target

Intended target system: 新增 UltraEP 独立论文笔记；更新索引行和 MoE systems / RSN 关系章节。
Existing related assets: content/utility/papers-index.md、2605.14220-training-inference-mismatch-llm-rl.md、2501.12948-deepseek-r1-rl-reasoning.md、2503.14476-dapo-long-cot-rl-system.md、2409.19256-hybridflow-rlhf-framework.md。
Proposed form: 新建独立 Markdown 文档并更新索引。

Reusable Elements

exact-load real-time balancing 的问题定义。
quota-driven expert replication + token reroute 抽象。
RSN-native persistent tile streaming 与 chunk streaming relay。
MoE training / prefill serving 的评测指标：throughput、TTFT、rank imbalance、activation memory peak、solver latency、communication latency。
与 MoE RL 系统、TIM/VeXact 的跨系统关系。

Risks

Copyright/over-copying: 本笔记采用转述和短公式摘要，未复制长段论文正文。
Unsourced or unverifiable claims: 代码未公开、production stack 不公开，相关结论已标注边界。
Tone/brand mismatch: 中文表达遵循本目录说明，避免对照式否定句。
Safety/compliance issues: 论文为系统性能研究，无直接滥用流程。
Overlap with existing assets: 与 HybridFlow、TIM/VeXact 有主题交叉，但本篇专注 MoE EP load balancing 与 RSN，适合独立归档。

Skipped

Material	Reason
完整伪代码逐行翻译	文档保留 planner 与 reroute 的核心机制，避免冗长重复
图中所有曲线读数	PDF 图表需要人工精确读数，本笔记保留正文报告的关键数值
production RefMoE 详细配置	论文公开信息有限，内部模型和 corpus 不可复验
RSN 供应商实现细节	论文只给抽象配置和带宽指标，未公开完整 fabric stack

Recommendation

Decision: merge。

Why: UltraEP 补齐本地档案中 MoE training / serving 底层系统效率的一块关键拼图，能直接连接 DeepSeek-R1 的 MoE 训练、DAPO/RLVR 长序列系统成本，以及 TIM/VeXact 的 MoE consistency 风险。

Source #

作者与关系 #

阅读目标与判断边界 #

论文脉络 #

1. 问题背景 #

2. 核心假设或切入点 #

3. 方法 / 系统 / 理论框架 #

3.1 Expert layout 与内存管理 #

3.2 Forward / backward pipeline #

3.3 Optimization formulation #

3.4 Quota-driven planner #

3.5 Reroute with locality #

3.6 GPU-native solving #

3.7 RSN-native balancing communication #

3.8 Implementation #

4. 结论链条 #

关键实验/定理 #

结果 1：load analysis 显示 expert load 非平稳 #

结果 2：端到端 training throughput 接近 ideal #

结果 3：serving prefill 提升更明显 #

结果 4：latency breakdown 显示 hot-path overhead 小 #

结果 5：activation memory peak 接近 ideal #

结果 6：quota solver 优于 EPLB+ planner #

结果 7：RSN-native communication 加速 expert replication #

结果 8：2560-GPU production MoE training #

证据链强度评估 #

强证据 #

中等强度证据 #

需要谨慎的推论 #

OpenReview / 审稿意见吸收 #

本地讨论补充 #

1. 讨论收敛点 #

2. 修正后的理解 #

3. 后续复验指标 #

4. X thread 版本 #

主要启发 #

局限 #

跨论文关系 #

Reference Intake Brief #

Target #

Reusable Elements #

Risks #

Skipped #

Recommendation #