Associate Laboratory for Computing and Applied Mathematics, National Institute for Space Research, PO BOX 515, 12227-010, São José dos Campos, SP, Brazil.
São Carlos Institute of Chemistry, University of São Paulo, PO Box 780, 13560-970, São Carlos, SP, Brazil.
J Phys Chem A. 2020 Nov 25;124(47):9854-9866. doi: 10.1021/acs.jpca.0c05969. Epub 2020 Nov 11.
Machine learning (ML) models can potentially accelerate the discovery of tailored materials by learning a function that maps chemical compounds into their respective target properties. In this realm, a crucial step is encoding the molecular systems into the ML model, in which the molecular representation plays a crucial role. Most of the representations are based on the use of atomic coordinates (structure); however, it can increase ML training and predictions' computational cost. Herein, we investigate the impact of choosing free-coordinate descriptors based on the Simplified Molecular Input Line Entry System (SMILES) representation, which can substantially reduce the ML predictions' computational cost. Therefore, we evaluate a feed-forward neural network (FNN) model's prediction performance over five feature selection methods and nine ground-state properties (including energetic, electronic, and thermodynamic properties) from a public data set composed of ∼130k organic molecules. Our best results reached a mean absolute error, close to chemical accuracy, of ∼0.05 eV for the atomization energies (internal energy at 0 K, internal energy at 298.15 K, enthalpy at 298.15 K, and free energy at 298.15 K). Moreover, for the atomization energies, the results obtained an out-of-sample error nine times less than the same FNN model trained with the Coulomb matrix, a traditional coordinate-based descriptor. Furthermore, our results showed how limited the model's accuracy is by employing such low computational cost representation that carries less information about the molecular structure than the most state-of-the-art methods.
机器学习 (ML) 模型可以通过学习将化学化合物映射到各自目标性质的函数,从而加速定制材料的发现。在这个领域中,一个关键步骤是将分子系统编码到 ML 模型中,其中分子表示起着至关重要的作用。大多数表示方法都是基于原子坐标(结构)的使用;然而,这会增加 ML 训练和预测的计算成本。在此,我们研究了选择基于简化分子输入行输入系统 (SMILES) 表示的自由坐标描述符的影响,这可以大大降低 ML 预测的计算成本。因此,我们评估了前馈神经网络 (FNN) 模型在五个特征选择方法和九个基态性质(包括能量、电子和热力学性质)上的预测性能,这些性质来自一个由约 130k 个有机分子组成的公共数据集。我们的最佳结果达到了原子化能(0 K 时的内能、298.15 K 时的内能、298.15 K 时的焓和 298.15 K 时的自由能)的平均绝对误差,接近化学精度,约为 0.05 eV。此外,对于原子化能,与使用传统坐标描述符库仑矩阵训练的相同 FNN 模型相比,所得结果的外推误差要小九倍。此外,我们的结果表明,使用这种计算成本较低的表示方法,模型的准确性受到限制,因为它携带的关于分子结构的信息比最先进的方法要少。