Simulation & Optimization Team, Sandbox AQ, Palo Alto, CA 94301.
Sandbox Alphabet X, The Moonshot Factory, Mountain View, CA 94043.
Proc Natl Acad Sci U S A. 2022 Aug 16;119(33):e2122762119. doi: 10.1073/pnas.2122762119. Epub 2022 Aug 8.
We have repurposed Google tensor processing units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast intercore interconnects (ICIs), physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this regime, the matrix-multiply units (MXUs) dominate the runtime, yielding impressive scaling, performance, and raw size: Operating in float32 precision, a full 2,048-core pod of third-generation TPUs can multiply two matrices with linear size [Formula: see text] in about 2 min. Via curated algorithms emphasizing large, single-core matrix multiplications, other tasks in dense linear algebra can similarly scale. As examples, we present 1) QR decomposition; 2) resolution of linear systems; and 3) the computation of matrix functions by polynomial iteration, demonstrated by the matrix polar factorization.
我们已经将谷歌张量处理单元(TPU)重新用于大规模密集线性代数超级计算机,TPU 是专为机器学习开发的专用芯片。TPU 的快速核间互连(ICI)、物理二维网络拓扑结构和高带宽内存(HBM)允许分布式矩阵乘法算法迅速成为计算限制。在这种情况下,矩阵乘法单元(MXU)占据了运行时间,产生了令人印象深刻的扩展、性能和原始大小:在 float32 精度下,一个完整的第三代 TPU 2048 核的 pod 可以在大约 2 分钟内乘以两个大小为 [公式:见正文] 的矩阵。通过强调大型单核矩阵乘法的精心设计的算法,密集线性代数中的其他任务也可以进行类似的扩展。作为示例,我们展示了 1)QR 分解;2)线性方程组的求解;3)通过多项式迭代计算矩阵函数,以矩阵极分解为例。