Suppr超能文献

在图形处理器(GPU)上使用张量核进行混合精度迭代细化以加速线性系统求解。

Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems.

作者信息

Haidar Azzam, Bayraktar Harun, Tomov Stanimire, Dongarra Jack, Higham Nicholas J

机构信息

NVIDIA, Santa Clara, CA, USA.

Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, USA.

出版信息

Proc Math Phys Eng Sci. 2020 Nov;476(2243):20200110. doi: 10.1098/rspa.2020.0110. Epub 2020 Nov 25.

Abstract

Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced-precision and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations  =  without sacrificing numerical stability. The techniques we employ include multiprecision LU factorization, the preconditioned generalized minimal residual algorithm (GMRES), and scaling and auto-adaptive rounding to avoid overflow. We also show how to efficiently handle systems with multiple right-hand sides. On the NVIDIA Quadro GV100 (Volta) GPU, we achieve a performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability.

摘要

几十年来,双精度浮点运算(FP64)一直是工程和科学模拟的事实上的标准。问题的复杂性以及来自各种仪器和传感器的大量数据促使研究人员混合使用各种方法来优化计算资源,包括不同级别的浮点精度。近年来,机器学习推动了对半精度浮点运算的硬件支持。高性能计算中的一个主要挑战是利用低精度和混合精度硬件。我们展示了如何利用NVIDIA GPU上的FP16/FP32张量核心来加速线性方程组 = 的求解,同时不牺牲数值稳定性。我们采用的技术包括多精度LU分解、预处理广义最小残差算法(GMRES)以及缩放和自动自适应舍入以避免溢出。我们还展示了如何有效地处理具有多个右侧项的系统。在NVIDIA Quadro GV100(Volta)GPU上,与标准的FP64实现相比,我们实现了性能提升和5倍的能效提升,同时保持了FP64级别的数值稳定性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49af/7735315/f515e6ae8f91/rspa20200110-g1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验