Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

Source

Title: Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents
arXiv: https://arxiv.org/abs/2606.06453
PDF: https://arxiv.org/pdf/2606.06453
Code: https://github.com/Infini-AI-Lab/vortex_torch
Website: https://infini-ai-lab.github.io/vortex_torch/
Authors: Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, Beidi Chen
Submitted: 2026-06-04

作者与关系

Zhuoming Chen: Carnegie Mellon University。
Xinrui Zhong: Rice University。
Qilong Feng: Carnegie Mellon University。
Ranajoy Sadhukhan: Carnegie Mellon University。
Yang Zhou: Carnegie Mellon University。
Michael Qizhe Shieh: National University of Singapore。
Zhihao Jia: Carnegie Mellon University。
Beidi Chen: Carnegie Mellon University。

关系判断：

论文以 CMU 为核心作者群，8 位作者中 6 位来自 CMU。
Xinrui Zhong 连接 Rice University，Michael Qizhe Shieh 连接 National University of Singapore，形成 CMU-Rice-NUS 的跨机构合作。
Zhihao Jia 与 Beidi Chen 位于 CMU 作者群末位，可能承担实验室方向、系统设计或项目指导角色；具体贡献需要作者贡献声明或项目仓库进一步确认。
该作者群应归入“LLM serving / sparse attention / AI-agent-assisted systems research”方向，和 Transformers are Inherently Succinct 的理论作者群互补，和 SocioHack 的 LLM 安全作者群没有作者重叠。

一句话结论

Vortex 是一个面向 LLM serving 的稀疏注意力开发和部署系统，让研究者或 AI agent 用少量 Python 风格代码表达稀疏注意力算法，并直接在真实 serving 系统里高效验证吞吐、延迟和准确率。

论文脉络

长上下文和长生成任务中，decode 阶段需要不断读取 KV cache，KV 内存搬运逐渐成为瓶颈。稀疏注意力通过只选择部分 KV block 来降低每步 decode 成本。

现实问题在于：新稀疏注意力算法通常很难落地。现代 serving 系统使用 paged KV cache、continuous batching、prefix caching 和高性能 attention backend。研究者如果要实现新算法，需要处理 page table、非连续内存、top-k、gather、cache 更新、GQA/MLA 后端等细节。论文举例说，Double Sparse 在 SGLang 里实现需要约 2000 行代码；Quest 原始实现即使有优化 kernel，在 AMC23 端到端评测中仍比 SGLang full attention 慢 44.4 倍。

Vortex 的目标是把稀疏注意力从“算法论文 + 手写系统工程”变成“可编程、可编译、可真实 serving 验证”的平台。

系统设计

1. vFlow: 稀疏注意力前端语言

vFlow 是 Python-embedded DSL，采用单请求、逻辑连续 tensor 的心智模型。用户只需要描述：

cache 阶段：如何从 KV cache 预计算 query-independent 的 block 统计量。
indexer 阶段：如何用当前 query 和缓存统计量给 block 打分，选 top-k block，再执行注意力。

例如 block top-k attention：

cache 阶段计算每个 block 的 key centroid。
indexer 阶段用 query 和 centroid 做矩阵乘，按 score 选 top-k block。
最后在选中的 block 上执行精确 softmax attention。

2. vTensor: page-centric tensor 抽象

真实 serving 系统里的 tensor 可能是：

batched layout。
ragged layout。
paged layout。

vTensor 把底层 PyTorch tensor 和 layout metadata 绑在一起，用 (x, C) 表示，其中 metadata 包含 batch size、pointer array 和 page index structure。这样用户写的是逻辑 tensor 操作，系统负责把它解释成真实 paged/ragged/batched 布局上的执行。

这个设计的关键是：

compositional: 算法由少量 tensor primitive 组合，不需要每个算法写 monolithic kernel。
self-contained: 中间结果保持后续 operator 可消费的格式，减少 layout conversion。

3. 执行后端优化

Vortex 做了几类工程优化：

workload planning: 根据 batch size、sequence length、layout 和 operator 类型规划 GPU workload。
kernel fusion: 对可融合的 sequence-local operator 做贪心融合，减少中间 tensor 的读写。
radix top-k 优化: top-k 是 indexer 阶段瓶颈，系统引入 stochastic early termination 和 remapping，在 recall 可控时换取速度。
backend compatibility: 复用 FlashInfer、TensorRT-LLM 等高性能 attention backend，并补充支持更通用 MLA 几何的 cuda_mla kernel。

关键实验结果

AI agent 自动生成算法

设置：

模型：Qwen3-1.7B。
硬件：NVIDIA H200 SXM。
任务：RULER 4K retrieval、AMC23、AIME24。
agent：Claude Code Opus 4.7、Claude Code Sonnet 4.6、Codex/GPT-5.5。
每个 agent 一次性生成 20 个 sparse attention 算法。

结果：

所有生成算法都能在 Vortex 中编译和执行，无需人工介入。
结构多样性得分：Sonnet 4.6 为 0.789，Opus 4.7 为 0.770，GPT-5 为 0.709。
生成算法在多个 benchmark 上形成 Pareto frontier，吞吐约为 full attention 的 2x 到 3.1x，同时保持接近 full attention 的准确率。

AI agent 长周期迭代

设置：

Claude Opus 4.7。
Qwen3-1.7B。
AIME24。
18 小时自动优化，23 轮，92 个提交。

结果：

最佳方案达到 11,894 tok/s，而 dense attention 为 3,437 tok/s，吞吐提升 3.46x。
mean@16 从 38.54 到 38.96，准确率保持。
最终收敛到 block top-k attention 及系统参数调优，说明当前 agent 更擅长在已知结构附近做工程和参数搜索。

MLA 架构实验

在 GLM-4.7-Flash 的 MLA 架构上：

任务：AIME26。
生成长度：32K。
硬件：单张 NVIDIA B200。
full attention mean@16 为 0.765，吞吐 1,031 tok/s。
rope-aware sparse attention mean@16 为 0.752，约 4x 吞吐提升，最高 4.7x。

关键发现：MLA 里 RoPE component 对 block routing 很重要。只看 compressed content 的 rope-unaware 方案明显掉点。

大模型 scaling

在 MiniMax-M2.7 229B 模型上：

硬件：4 张 NVIDIA B200。
TP=4。
任务：AIME26。
生成长度：32K。
full attention mean@16 为 0.83，吞吐 3,341 tok/s。
block top-k mean@16 为 0.84，吞吐 4,110 tok/s，提升 1.23x。
更紧 block budget 下最高 1.37x。

增益比小模型小，原因是 229B TP 场景里参数和通信开销占比更高，KV cache 只占整体瓶颈的一部分。

用户侧延迟

在 16K 输入、高请求率下：

block top-k P95 TPOT 最高降低 11.7x。
Quest P95 TPOT 最高降低 12.8x。

稀疏注意力在高并发、长上下文场景收益最大；低负载时 attention 不一定是主瓶颈，收益会下降。

kernel ablation

Qwen3-8B、batch size 16、32K context 下，sparse attention kernel 为 0.025 ms，dense attention 为 0.760 ms，kernel 级超过 30x。
indexer 和 cache kernel 只有约 1 到 10 微秒，说明 Vortex 带来的索引/缓存开销较小。
approx_radix_topk + remap 在 $\mathrm{recall@k}>0.97$ 下相对 radix top-k baseline 平均加速 1.49x，范围 1.30x 到 1.62x。

主要启发

稀疏注意力的核心瓶颈同时覆盖 attention kernel 和“如何高效构造动态 sparsity pattern”。
可编程系统和真实 serving 验证闭环会显著放大 AI agent 的算法探索能力。
一个好的 DSL 可以把 AI agent 的输出限制在可编译、可 benchmark、可部署的空间里。
对长生成和高并发 workload，decode sparse attention 可以带来显著吞吐和延迟收益。
大模型 TP 场景里，KV cache 优化的相对收益会被参数和通信开销稀释，需要系统级瓶颈分析。

局限

当前主要覆盖 decode 阶段，不支持 prefill sparse attention。
当前面向 inference serving，不支持训练和 backward pass。
agent 迭代最终主要收敛到 block top-k 和系统参数优化，还没有稳定发现根本更优的新机制。
实验依赖特定硬件、模型、benchmark 和 serving stack，迁移到其他环境需要复测。
部分 top-k 优化是近似的，需要按任务确认 accuracy/recall trade-off。

Reference Intake Brief

Target

Intended target system: 新增论文笔记 / LLM serving 与稀疏注意力系统文档。
Existing related assets: papers-index.md 将作为总索引。
Proposed form: 新建独立 Markdown 文档。

Reusable Elements

vFlow cache/indexer 程序模型。
vTensor paged tensor 抽象。
sparse attention serving 的真实瓶颈拆解。
AI-agent-assisted systems research 的实验流程。
decode sparse attention 的吞吐和延迟评测结果。

Risks

Copyright/over-copying: 文档采用转述和结果摘录，未复制长段原文。
Unsourced or unverifiable claims: 主要事实来自 arXiv 论文和项目仓库。
Safety/compliance issues: 该论文为系统性能研究，无直接滥用细节。

Skipped

Material	Reason
60 个 agent-generated 算法逐条展开	附录很长，本次沉淀保留类别和代表性结论
完整 vFlow API 表	需要时可从官方文档引用
所有 Pareto 图逐点数据	论文以图呈现，本文档保留关键数值

Recommendation

Decision: merge as a new paper note.

Why: 该论文兼具系统工程、稀疏注意力和 AI agent 自动研究流程价值，适合作为后续跟踪 LLM serving 与 agentic research 的重点材料。

Source #

作者与关系 #

一句话结论 #

论文脉络 #

系统设计 #

1. vFlow: 稀疏注意力前端语言 #

2. vTensor: page-centric tensor 抽象 #

3. 执行后端优化 #

关键实验结果 #

AI agent 自动生成算法 #

AI agent 长周期迭代 #

MLA 架构实验 #

大模型 scaling #

用户侧延迟 #

kernel ablation #

主要启发 #

局限 #

Reference Intake Brief #

Target #

Reusable Elements #

Risks #

Skipped #

Recommendation #