Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States.
J Chem Theory Comput. 2022 Apr 12;18(4):2132-2143. doi: 10.1021/acs.jctc.1c00504. Epub 2022 Feb 28.
Deep learning methods provide a novel way to establish a correlation between two quantities. In this context, computer vision techniques such as three-dimensional (3D)-convolutional neural networks become a natural choice to associate a molecular property with its structure due to the inherent 3D nature of a molecule. However, traditional 3D input data structures are intrinsically sparse in nature, which tend to induce instabilities during the learning process, which in turn may lead to underfitted results. To address this deficiency, in this project, we propose to use quantum-chemically derived molecular topological features, namely, localized orbital locator and electron localization function, as molecular descriptors, which provide a relatively denser input representation in a 3D space. Such topological features provide a detailed picture of the atomic and electronic configuration and interatomic interactions in the molecule and hence are ideal for predicting properties that are highly dependent on the physical or electronic structure of the molecule. Herein, we demonstrate the efficacy of our proposed model by applying it to the task of predicting atomization energies for the QM9-G4MP2 data set, which contains ∼134k molecules. Furthermore, we incorporated the Δ-machine learning approach into our model, which enabled us to reach beyond benchmark accuracy levels (∼1.0 kJ mol). As a result, we consistently obtain impressive mean absolute errors of the order 0.1 kcal mol (∼0.42 kJ mol) versus the G4(MP2) theory using relatively modest models, which could potentially be improved further in a systematic manner using additional compute resources.
深度学习方法为建立两个数量之间的相关性提供了一种新方法。在这种情况下,由于分子的固有 3D 性质,计算机视觉技术(如三维卷积神经网络)成为将分子性质与其结构相关联的自然选择。然而,传统的 3D 输入数据结构本质上是稀疏的,这往往会在学习过程中引起不稳定性,从而可能导致拟合不足的结果。为了解决这个缺陷,在这个项目中,我们提出使用量子化学衍生的分子拓扑特征,即局域轨道定位器和电子定域函数,作为分子描述符,它们在 3D 空间中提供了相对密集的输入表示。这些拓扑特征提供了分子中原子和电子构型以及原子间相互作用的详细图像,因此非常适合预测高度依赖分子物理或电子结构的性质。在此,我们通过将其应用于包含约 134k 个分子的 QM9-G4MP2 数据集的原子化能预测任务来证明我们提出的模型的有效性。此外,我们将 Δ-机器学习方法纳入我们的模型中,这使我们能够超越基准精度水平(约 1.0 kJ mol)。结果,我们使用相对较小的模型始终获得令人印象深刻的约 0.1 kcal mol(约 0.42 kJ mol)的平均绝对误差,这可能可以通过使用额外的计算资源以系统的方式进一步提高。