Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545.
Center of Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM 87545.
Proc Natl Acad Sci U S A. 2022 Jul 5;119(27):e2120333119. doi: 10.1073/pnas.2120333119. Epub 2022 Jul 1.
Conventional machine-learning (ML) models in computational chemistry learn to directly predict molecular properties using quantum chemistry only for reference data. While these heuristic ML methods show quantum-level accuracy with speeds several orders of magnitude faster than traditional quantum chemistry methods, they suffer from poor extensibility and transferability; i.e., their accuracy degrades on large or new chemical systems. Incorporating quantum chemistry frameworks into the ML models directly solves this problem. Here we take the structure of semiempirical quantum mechanics (SEQM) methods to construct dynamically responsive Hamiltonians. SEQM methods use empirical parameters fitted to experimental properties to construct reduced-order Hamiltonians, facilitating much faster calculations than ab initio methods but with compromised accuracy. By replacing these static parameters with machine-learned dynamic values inferred from the local environment, we greatly improve the accuracy of the SEQM methods. Trained on molecular energies and atomic forces, these dynamically generated Hamiltonian parameters show a strong correlation with atomic hybridization and bonding. Trained with only about 60,000 small organic molecular conformers, the resulting model retains interpretability, extensibility, and transferability when testing on much larger chemical systems and predicting various molecular properties. Overall, this work demonstrates the virtues of incorporating physics-based descriptions with ML to develop models that are simultaneously accurate, transferable, and interpretable.
传统的计算化学中的机器学习 (ML) 模型通过仅将量子化学用作参考数据来学习直接预测分子性质。虽然这些启发式 ML 方法具有与传统量子化学方法相比快几个数量级的量子级精度,但它们存在扩展性和可转移性差的问题;也就是说,在大型或新的化学体系上,它们的精度会下降。将量子化学框架直接纳入 ML 模型可以解决这个问题。在这里,我们采用半经验量子力学 (SEQM) 方法的结构来构建动态响应的哈密顿量。SEQM 方法使用拟合实验性质的经验参数来构建降阶哈密顿量,这使得计算速度比从头算方法快得多,但精度却有所降低。通过用从局部环境推断出的机器学习动态值替换这些静态参数,我们大大提高了 SEQM 方法的精度。在分子能量和原子力上进行训练后,这些动态生成的哈密顿量参数与原子杂化和键合具有很强的相关性。只用大约 60,000 个小分子构象进行训练,当在更大的化学体系上进行测试并预测各种分子性质时,所得模型仍然具有可解释性、扩展性和可转移性。总的来说,这项工作展示了将基于物理的描述与 ML 结合起来开发同时具有准确性、可转移性和可解释性的模型的优点。