相似文献

1

Large-scale distributed linear algebra with tensor processing units.

Proc Natl Acad Sci U S A. 2022 Aug 16;119(33):e2122762119. doi: 10.1073/pnas.2122762119. Epub 2022 Aug 8.

2

Variational algorithms for linear algebra.

Sci Bull (Beijing). 2021 Nov 15;66(21):2181-2188. doi: 10.1016/j.scib.2021.06.023. Epub 2021 Jun 26.

3

Fast computation of the eigensystem of genomic similarity matrices.

BMC Bioinformatics. 2024 Jan 25;25(1):43. doi: 10.1186/s12859-024-05650-8.

4

TAMM: Tensor algebra for many-body methods.

J Chem Phys. 2023 Jul 14;159(2). doi: 10.1063/5.0142433.

5

Straggler- and Adversary-Tolerant Secure Distributed Matrix Multiplication Using Polynomial Codes.

Entropy (Basel). 2023 Jan 31;25(2):266. doi: 10.3390/e25020266.

6

Network Coding Approaches for Distributed Computation over Lossy Wireless Networks.

Entropy (Basel). 2023 Feb 27;25(3):428. doi: 10.3390/e25030428.

7

Tensor Networks for Latent Variable Analysis: Novel Algorithms for Tensor Train Approximation.

IEEE Trans Neural Netw Learn Syst. 2020 Nov;31(11):4622-4636. doi: 10.1109/TNNLS.2019.2956926. Epub 2020 Oct 30.

8

Precision-extension technique for accurate vector-matrix multiplication with a CNT transistor crossbar array.

Nanoscale. 2019 Nov 28;11(44):21449-21457. doi: 10.1039/c9nr06715a. Epub 2019 Nov 4.

9

Accelerating Correlated Quantum Chemistry Calculations Using Graphical Processing Units and a Mixed Precision Matrix Multiplication Library.

J Chem Theory Comput. 2010 Jan 12;6(1):135-44. doi: 10.1021/ct900543q.

10

Tensor learning of pointwise mutual information from EHR data for early prediction of sepsis.

Comput Biol Med. 2021 Jul;134:104430. doi: 10.1016/j.compbiomed.2021.104430. Epub 2021 May 7.

引用本文的文献

1

Accelerated linear algebra compiler for computationally efficient numerical models: Success and potential area of improvement.

PLoS One. 2023 Feb 24;18(2):e0282265. doi: 10.1371/journal.pone.0282265. eCollection 2023.

本文引用的文献

1

Machine learning-accelerated computational fluid dynamics.

Proc Natl Acad Sci U S A. 2021 May 25;118(21). doi: 10.1073/pnas.2101784118.

2

Machine learning guided aptamer refinement and discovery.

Nat Commun. 2021 Apr 22;12(1):2366. doi: 10.1038/s41467-021-22555-9.

3

Kohn-Sham Equations as Regularizer: Building Prior Knowledge into Machine-Learned Physics.

Phys Rev Lett. 2021 Jan 22;126(3):036401. doi: 10.1103/PhysRevLett.126.036401.

4

Learning data-driven discretizations for partial differential equations.

Proc Natl Acad Sci U S A. 2019 Jul 30;116(31):15344-15349. doi: 10.1073/pnas.1814058116. Epub 2019 Jul 16.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

文档翻译

学术文献翻译模型，支持多种主流文档格式。

张量处理单元的大规模分布式线性代数。

Large-scale distributed linear algebra with tensor processing units.

机构信息

Simulation & Optimization Team, Sandbox AQ, Palo Alto, CA 94301.

Sandbox Alphabet X, The Moonshot Factory, Mountain View, CA 94043.

出版信息

Proc Natl Acad Sci U S A. 2022 Aug 16;119(33):e2122762119. doi: 10.1073/pnas.2122762119. Epub 2022 Aug 8.

DOI:10.1073/pnas.2122762119

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9388123/

Abstract

We have repurposed Google tensor processing units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast intercore interconnects (ICIs), physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this regime, the matrix-multiply units (MXUs) dominate the runtime, yielding impressive scaling, performance, and raw size: Operating in float32 precision, a full 2,048-core pod of third-generation TPUs can multiply two matrices with linear size [Formula: see text] in about 2 min. Via curated algorithms emphasizing large, single-core matrix multiplications, other tasks in dense linear algebra can similarly scale. As examples, we present 1) QR decomposition; 2) resolution of linear systems; and 3) the computation of matrix functions by polynomial iteration, demonstrated by the matrix polar factorization.

摘要

我们已经将谷歌张量处理单元（TPU）重新用于大规模密集线性代数超级计算机，TPU 是专为机器学习开发的专用芯片。TPU 的快速核间互连（ICI）、物理二维网络拓扑结构和高带宽内存（HBM）允许分布式矩阵乘法算法迅速成为计算限制。在这种情况下，矩阵乘法单元（MXU）占据了运行时间，产生了令人印象深刻的扩展、性能和原始大小：在 float32 精度下，一个完整的第三代 TPU 2048 核的 pod 可以在大约 2 分钟内乘以两个大小为 [公式：见正文] 的矩阵。通过强调大型单核矩阵乘法的精心设计的算法，密集线性代数中的其他任务也可以进行类似的扩展。作为示例，我们展示了 1）QR 分解；2）线性方程组的求解；3）通过多项式迭代计算矩阵函数，以矩阵极分解为例。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c3e1/9388123/82e460ee1445/pnas.2122762119fig01.jpg