Suppr超能文献

基于GPU集群的高性能福克矩阵构建的先进技术

Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters.

作者信息

Palethorpe Elise, Stocks Ryan, Barca Giuseppe M J

机构信息

School of Computing, Australian National University, Canberra, ACT 2601, Australia.

School of Computing and Information Systems, Melbourne University, Melbourne, VIC 3052, Australia.

出版信息

J Chem Theory Comput. 2024 Dec 10;20(23):10424-10442. doi: 10.1021/acs.jctc.4c00994. Epub 2024 Nov 25.

Abstract

This Article presents two optimized multi-GPU algorithms for Fock matrix construction, building on the work of Ufimtsev and Martinez [ 2009, 5, 1004-1015] and Barca et al. [ 2021, 17, 7486-7503]. The novel algorithms, opt-UM and opt-Brc, introduce significant enhancements, including improved integral screening, exploitation of sparsity and symmetry, a linear scaling exchange matrix assembly algorithm, and extended capabilities for Hartree-Fock caculations up to -type angular momentum functions. Opt-Brc excels for smaller systems and for highly contracted triple-ζ basis sets, while opt-UM is advantageous for large molecular systems. Performance benchmarks on NVIDIA A100 GPUs show that our algorithms in the EXtreme-scale Electronic Structure System (EXESS), when combined, outperform all current GPU and CPU Fock build implementations in TeraChem, QUICK, GPU4PySCF, LibIntX, ORCA, and Q-Chem. The implementations were benchmarked on linear and globular systems and average speed ups across three double-ζ basis sets of 1.4×, 8.4×, and 9.4× were observed compared to TeraChem, QUICK, and GPU4PySCF respectively. An increased average speedup of 2.1× over TeraChem is observed when using four A100 GPUs. Strong scaling analysis reveals over 91% parallel efficiency on four GPUs for opt-Brc, making it typically faster for multi-GPU execution. Single-compute-node comparisons with CPU-based software like ORCA and Q-Chem show speedups of up to 42× and 31×, respectively, enhancing power efficiency by up to 18×.

摘要

本文基于乌菲姆采夫和马丁内斯[2009, 5, 1004 - 1015]以及巴尔卡等人[2021, 17, 7486 - 7503]的工作,提出了两种用于构建福克矩阵的优化多GPU算法。新算法opt - UM和opt - Brc带来了显著改进,包括改进的积分筛选、稀疏性和对称性的利用、线性缩放交换矩阵组装算法,以及扩展了高达 - 型角动量函数的哈特里 - 福克计算能力。Opt - Brc在较小系统和高度收缩的三重ζ基组方面表现出色,而opt - UM对大分子系统有利。在NVIDIA A100 GPU上的性能基准测试表明,我们在极端规模电子结构系统(EXESS)中的算法相结合时,优于TeraChem、QUICK、GPU4PySCF、LibIntX、ORCA和Q - Chem中所有当前的GPU和CPU福克构建实现。这些实现在线性和球形系统上进行了基准测试,与TeraChem、QUICK和GPU4PySCF相比,在三个双ζ基组上分别观察到平均加速比为1.4倍、8.4倍和9.4倍。使用四个A100 GPU时,与TeraChem相比,平均加速比提高了2.1倍。强缩放分析表明,opt - Brc在四个GPU上的并行效率超过91%,使其在多GPU执行时通常更快。与基于CPU的软件如ORCA和Q - Chem进行的单计算节点比较显示,加速比分别高达42倍和31倍,功率效率提高了高达18倍。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验