Menczer Andor, van Damme Maarten, Rask Alan, Huntington Lee, Hammond Jeff, Xantheas Sotiris S, Ganahl Martin, Legeza Örs
Strongly Correlated Systems Lendület Research Group, Wigner Research Centre for Physics, H-1525 Budapest, Hungary.
Eötvös Loránd University, Pázmány Péter Sétány 1/C, 1117 Budapest, Hungary.
J Chem Theory Comput. 2024 Oct 8;20(19):8397-8404. doi: 10.1021/acs.jctc.4c00903. Epub 2024 Sep 19.
We report cutting edge performance results on a single node hybrid CPU-multi-GPU implementation of the spin adapted Density Matrix Renormalization Group (DMRG) method on current state-of-the-art NVIDIA DGX-H100 architectures. We evaluate the performance of the DMRG electronic structure calculations for the active compounds of the FeMoco, the primary cofactor of nitrogenase, and cytochrome P450 (CYP) enzymes with complete active space (CAS) sizes of up to 113 electrons in 76 orbitals [CAS(113, 76)] and 63 electrons in 58 orbitals [CAS(63, 58)], respectively. We achieve 246 teraFLOPS of sustained performance, an improvement of more than 2.5× compared to the performance achieved on the DGX-A100 architectures and an 80× acceleration compared to an OpenMP parallelized implementation on a 128-core CPU architecture. Our work highlights the ability of tensor network algorithms to efficiently utilize high-performance multi-GPU hardware and shows that the combination of tensor networks with modern large-scale GPU accelerators can pave the way toward solving some of the most challenging problems in quantum chemistry and beyond.
我们报告了在当前最先进的NVIDIA DGX-H100架构上,对自旋适配密度矩阵重整化群(DMRG)方法进行单节点混合CPU-多GPU实现的前沿性能结果。我们评估了DMRG电子结构计算对固氮酶的主要辅因子FeMoco以及细胞色素P450(CYP)酶的活性化合物的性能,其完全活性空间(CAS)大小分别高达76个轨道中的113个电子[CAS(113, 76)]和58个轨道中的63个电子[CAS(63, 58)]。我们实现了246万亿次浮点运算的持续性能,与在DGX-A100架构上实现的性能相比提高了2.5倍以上,与在128核CPU架构上的OpenMP并行实现相比加速了80倍。我们的工作突出了张量网络算法有效利用高性能多GPU硬件的能力,并表明张量网络与现代大规模GPU加速器的结合可以为解决量子化学及其他领域一些最具挑战性的问题铺平道路。