Li Jie, Liang Jiashu, Wang Zhe, Ptaszek Aleksandra L, Liu Xiao, Ganoe Brad, Head-Gordon Martin, Head-Gordon Teresa
Pitzer Center for Theoretical Chemistry, Department of Chemistry, University of California, Berkeley, California 94720, United States.
Christian Doppler Laboratory for High-Content Structural Biology and Biotechnology, Department of Structural and Computational Biology, Max Perutz Laboratories, University of Vienna, Campus Vienna Biocenter 5, Vienna 1030, Austria.
J Chem Theory Comput. 2024 Mar 12;20(5):2152-2166. doi: 10.1021/acs.jctc.3c01256. Epub 2024 Feb 8.
Theoretical predictions of NMR chemical shifts from first-principles can greatly facilitate experimental interpretation and structure identification of molecules in gas, solution, and solid-state phases. However, accurate prediction of chemical shifts using the gold-standard coupled cluster with singles, doubles, and perturbative triple excitations [CCSD(T)] method with a complete basis set (CBS) can be prohibitively expensive. By contrast, machine learning (ML) methods offer inexpensive alternatives for chemical shift predictions but are hampered by generalization to molecules outside the original training set. Here, we propose several new ideas in ML of the chemical shift prediction for H, C, N, and O that first introduce a novel feature representation, based on the atomic chemical shielding tensors within a molecular environment using an inexpensive quantum mechanics (QM) method, and train it to predict NMR chemical shieldings of a high-level composite theory that approaches the accuracy of CCSD(T)/CBS. In addition, we train the ML model through a new progressive active learning workflow that reduces the total number of expensive high-level composite calculations required while allowing the model to continuously improve on unseen data. Furthermore, the algorithm provides an error estimation, signaling potential unreliability in predictions if the error is large. Finally, we introduce a novel approach to keep the rotational invariance of the features using tensor environment vectors (TEVs) that yields a ML model with the highest accuracy compared to a similar model using data augmentation. We illustrate the predictive capacity of the resulting inexpensive shift machine learning (iShiftML) models across several benchmarks, including unseen molecules in the NS372 data set, gas-phase experimental chemical shifts for small organic molecules, and much larger and more complex natural products in which we can accurately differentiate between subtle diastereomers based on chemical shift assignments.
从第一原理对核磁共振(NMR)化学位移进行理论预测,能够极大地促进对气相、溶液相和固态相中分子的实验解释及结构鉴定。然而,使用具有完备基组(CBS)的金标准耦合簇单双激发及微扰三激发[CCSD(T)]方法精确预测化学位移的成本可能高得令人望而却步。相比之下,机器学习(ML)方法为化学位移预测提供了低成本的替代方案,但在推广到原始训练集之外的分子时受到限制。在此,我们针对H、C、N和O的化学位移预测在机器学习方面提出了几个新想法,首先引入一种新颖的特征表示,它基于使用低成本量子力学(QM)方法在分子环境中的原子化学屏蔽张量,并对其进行训练以预测接近CCSD(T)/CBS精度的高级复合理论的NMR化学屏蔽。此外,我们通过一种新的渐进式主动学习工作流程来训练ML模型,该流程减少了所需的昂贵高级复合计算的总数,同时允许模型在未见数据上不断改进。此外,该算法提供误差估计,如果误差较大则表明预测可能不可靠。最后,我们引入一种使用张量环境向量(TEV)来保持特征旋转不变性的新颖方法,与使用数据增强的类似模型相比,该方法产生的ML模型具有最高的精度。我们通过几个基准测试展示了所得低成本位移机器学习(iShiftML)模型的预测能力,包括NS372数据集中未见的分子、小有机分子的气相实验化学位移,以及更大且更复杂的天然产物,在这些天然产物中我们可以根据化学位移归属准确区分细微的非对映异构体。