在图形处理器（GPU）上使用张量核进行混合精度迭代细化以加速线性系统求解。

Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems.

作者信息

Haidar Azzam, Bayraktar Harun, Tomov Stanimire, Dongarra Jack, Higham Nicholas J

机构信息

NVIDIA, Santa Clara, CA, USA.

Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, USA.

出版信息

Proc Math Phys Eng Sci. 2020 Nov;476(2243):20200110. doi: 10.1098/rspa.2020.0110. Epub 2020 Nov 25.

DOI:10.1098/rspa.2020.0110

PMID:33363437

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7735315/

Abstract

Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced-precision and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations = without sacrificing numerical stability. The techniques we employ include multiprecision LU factorization, the preconditioned generalized minimal residual algorithm (GMRES), and scaling and auto-adaptive rounding to avoid overflow. We also show how to efficiently handle systems with multiple right-hand sides. On the NVIDIA Quadro GV100 (Volta) GPU, we achieve a performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability.

摘要

几十年来，双精度浮点运算（FP64）一直是工程和科学模拟的事实上的标准。问题的复杂性以及来自各种仪器和传感器的大量数据促使研究人员混合使用各种方法来优化计算资源，包括不同级别的浮点精度。近年来，机器学习推动了对半精度浮点运算的硬件支持。高性能计算中的一个主要挑战是利用低精度和混合精度硬件。我们展示了如何利用NVIDIA GPU上的FP16/FP32张量核心来加速线性方程组 = 的求解，同时不牺牲数值稳定性。我们采用的技术包括多精度LU分解、预处理广义最小残差算法（GMRES）以及缩放和自动自适应舍入以避免溢出。我们还展示了如何有效地处理具有多个右侧项的系统。在NVIDIA Quadro GV100（Volta）GPU上，与标准的FP64实现相比，我们实现了性能提升和5倍的能效提升，同时保持了FP64级别的数值稳定性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49af/7735315/f515e6ae8f91/rspa20200110-g1.jpg

相似文献

Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems.在图形处理器（GPU）上使用张量核进行混合精度迭代细化以加速线性系统求解。

Proc Math Phys Eng Sci. 2020 Nov;476(2243):20200110. doi: 10.1098/rspa.2020.0110. Epub 2020 Nov 25.

Numerical behavior of NVIDIA tensor cores.英伟达张量核的数值行为。

PeerJ Comput Sci. 2021 Feb 10;7:e330. doi: 10.7717/peerj-cs.330. eCollection 2021.

Performance impact of precision reduction in sparse linear systems solvers.稀疏线性系统求解器中精度降低的性能影响。

PeerJ Comput Sci. 2022 Jan 17;8:e778. doi: 10.7717/peerj-cs.778. eCollection 2022.

Efficient Mixed-Precision Matrix Factorization of the Inverse Overlap Matrix in Electronic Structure Calculations with AI-Hardware and GPUs.利用人工智能硬件和图形处理器在电子结构计算中对逆重叠矩阵进行高效混合精度矩阵分解

J Chem Theory Comput. 2024 Aug 13. doi: 10.1021/acs.jctc.4c00584.

Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats.

Phys Rev E. 2022 Jul;106(1-2):015308. doi: 10.1103/PhysRevE.106.015308.

Accelerating seminumerical Fock-exchange calculations using mixed single- and double-precision arithmethic.使用单双精度混合算术加速半数值福克交换计算。

J Chem Phys. 2021 Jun 7;154(21):214116. doi: 10.1063/5.0045084.

Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born.使用AMBER在GPU上进行常规微秒级分子动力学模拟。1. 广义玻恩模型

J Chem Theory Comput. 2012 May 8;8(5):1542-1555. doi: 10.1021/ct200909j. Epub 2012 Mar 26.

Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations.随机舍入和降低精度定点算术在求解神经常微分方程中的应用。

Philos Trans A Math Phys Eng Sci. 2020 Mar 6;378(2166):20190052. doi: 10.1098/rsta.2019.0052. Epub 2020 Jan 20.

Performance of preconditioned iterative linear solvers for cardiovascular simulations in rigid and deformable vessels.用于刚性和可变形血管心血管模拟的预处理迭代线性求解器的性能

Comput Mech. 2019 Sep 15;64:717-739. doi: 10.1007/s00466-019-01678-3. Epub 2019 Feb 6.

Quantum-Based Molecular Dynamics Simulations Using Tensor Cores.基于张量核的量子分子动力学模拟。

J Chem Theory Comput. 2021 Oct 12;17(10):6180-6192. doi: 10.1021/acs.jctc.1c00726. Epub 2021 Oct 1.

引用本文的文献

Fixed-point iterative linear inverse solver with extended precision.具有扩展精度的定点迭代线性反演求解器。

Sci Rep. 2023 Mar 30;13(1):5198. doi: 10.1038/s41598-023-32338-5.

Performance impact of precision reduction in sparse linear systems solvers.稀疏线性系统求解器中精度降低的性能影响。

PeerJ Comput Sci. 2022 Jan 17;8:e778. doi: 10.7717/peerj-cs.778. eCollection 2022.

Numerical behavior of NVIDIA tensor cores.英伟达张量核的数值行为。

PeerJ Comput Sci. 2021 Feb 10;7:e330. doi: 10.7717/peerj-cs.330. eCollection 2021.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

在图形处理器（GPU）上使用张量核进行混合精度迭代细化以加速线性系统求解。

Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献