Snowdon Calum, Barca Giuseppe M J
School of Computing, Australian National University, Canberra 2600, Australia.
School of Computing and Information Systems, University of Melbourne, Melbourne 3010, Australia.
J Chem Theory Comput. 2024 Nov 12;20(21):9394-9406. doi: 10.1021/acs.jctc.4c00814. Epub 2024 Oct 18.
Second-order Møller-Plesset perturbation theory (MP2) using the Resolution of the Identity approximation (RI-MP2) is a widely used method for computing molecular energies beyond the Hartree-Fock mean-field approximation. However, its high computational cost and lack of efficient algorithms for modern supercomputing architectures limit its applicability to large molecules. In this paper, we present the first distributed-memory many-GPU RI-MP2 algorithm explicitly designed to utilize hundreds of GPU accelerators for every step of the computation. Our novel algorithm achieves near-peak performance on GPU-based supercomputers through the development of a distributed memory algorithm for forming RI-MP2 intermediate tensors with zero internode communication, except for a single asynchronous broadcast, and a distributed memory algorithm for the energy reduction step, capable of sustaining near-peak performance on clusters with several hundred GPUs. Comparative analysis shows our implementation outperforms state-of-the-art quantum chemistry software by over 3.5 times in speed while achieving an 8-fold reduction in computational power consumption. Benchmarking on the Perlmutter supercomputer, our algorithm achieves 11.8 PFLOP/s (83% of peak performance) performing and the RI-MP2 energy calculation on a 314-water cluster with 7850 primary and 30,144 auxiliary basis functions in 4 min on 180 nodes and 720 A100 GPUs. This performance represents a substantial improvement over traditional CPU-based methods, demonstrating significant time-to-solution and power consumption benefits of leveraging modern GPU-accelerated computing environments for quantum chemistry calculations.
使用单位分解近似(RI-MP2)的二阶莫勒-普莱塞特微扰理论(MP2)是一种广泛用于计算超越哈特里-福克平均场近似的分子能量的方法。然而,其高昂的计算成本以及缺乏适用于现代超级计算架构的高效算法,限制了它在大分子中的应用。在本文中,我们提出了首个分布式内存多GPU RI-MP2算法,该算法经过专门设计,在计算的每一步都能利用数百个GPU加速器。我们的新算法通过开发一种分布式内存算法来形成RI-MP2中间张量,除了一次异步广播外,节点间通信为零,以及一种用于能量约简步骤的分布式内存算法,在基于GPU的超级计算机上实现了接近峰值的性能,该算法能够在拥有数百个GPU的集群上维持接近峰值的性能。对比分析表明,我们的实现速度比最先进的量子化学软件快3.5倍以上,同时计算功耗降低了8倍。在珀尔马特超级计算机上进行基准测试,我们的算法在180个节点和720个A100 GPU上,对具有7850个基函数和30144个辅助基函数的314水团簇进行RI-MP2能量计算,每秒可执行11.8万亿次浮点运算(达到峰值性能的83%),耗时4分钟。这种性能相较于传统的基于CPU的方法有了显著提升,证明了利用现代GPU加速计算环境进行量子化学计算在求解时间和功耗方面具有显著优势。