Department of Chemistry , Columbia University , 3000 Broadway , New York , New York 10027 , United States.
Schrödinger Inc., 120 West 45th Street , New York , New York 10036 , United States.
J Chem Theory Comput. 2018 Aug 14;14(8):4109-4121. doi: 10.1021/acs.jctc.8b00342. Epub 2018 Jul 2.
We present an implementation of phaseless Auxiliary-Field Quantum Monte Carlo (ph-AFQMC) utilizing graphical processing units (GPUs). The AFQMC method is recast in terms of matrix operations which are spread across thousands of processing cores and are executed in batches using custom Compute Unified Device Architecture kernels and the GPU-optimized cuBLAS matrix library. Algorithmic advances include a batched Sherman-Morrison-Woodbury algorithm to quickly update matrix determinants and inverses, density-fitting of the two-electron integrals, an energy algorithm involving a high-dimensional precomputed tensor, and the use of single-precision floating point arithmetic. These strategies accelerate ph-AFQMC calculations with both single- and multideterminant trial wave functions, though particularly dramatic wall-time reductions are achieved for the latter. For typical calculations we find speed-ups of roughly 2 orders of magnitude using just a single GPU card compared to a single modern CPU core. Furthermore, we achieve near-unity parallel efficiency using 8 GPU cards on a single node and can reach moderate system sizes via a local memory-slicing approach. We illustrate the robustness of our implementation on hydrogen chains of increasing length and through the calculation of all-electron ionization potentials of the first-row transition metal atoms. We compare long imaginary-time calculations utilizing a population control algorithm with our previously published correlated sampling approach and show that the latter improves not only the efficiency but also the accuracy of the computed ionization potentials. Taken together, the GPU implementation combined with correlated sampling provides a compelling computational method that will broaden the application of ph-AFQMC to the description of realistic correlated electronic systems.
我们提出了一种利用图形处理单元(GPU)实现无相位辅助场量子蒙特卡罗(ph-AFQMC)的方法。该 AFQMC 方法被重铸为矩阵运算,这些运算分布在数千个处理核上,并使用自定义的计算统一设备架构(CUDA)内核和 GPU 优化的 cuBLAS 矩阵库分批次执行。算法上的改进包括批量 Sherman-Morrison-Woodbury 算法,用于快速更新矩阵行列式和逆矩阵;双电子积分的密度拟合;涉及高维预计算张量的能量算法;以及使用单精度浮点数算术。这些策略加速了单行列式和多行列式试探波函数的 ph-AFQMC 计算,尽管后者的计算时间显著缩短。对于典型的计算,我们发现与单个现代 CPU 核相比,仅使用单个 GPU 卡就可以实现大约 2 个数量级的加速。此外,我们在单个节点上使用 8 个 GPU 卡实现了接近 1 的并行效率,并通过局部内存切片方法可以达到中等系统规模。我们通过增加长度的氢链和计算第一过渡金属原子的全电子电离势来展示我们实现的稳健性。我们将利用群体控制算法的长虚时间计算与我们之前发布的相关采样方法进行比较,并表明后者不仅提高了计算电离势的效率,而且提高了其准确性。总之,GPU 实现与相关采样相结合,提供了一种引人注目的计算方法,将拓宽 ph-AFQMC 在描述现实相关电子系统中的应用。