Department of Chemistry and Biomolecular Science , University of Ottawa , Ottawa K1N 6N5 , Canada.
J Chem Theory Comput. 2018 Oct 9;14(10):5229-5237. doi: 10.1021/acs.jctc.8b00788. Epub 2018 Sep 10.
Understanding the performance of machine learning algorithms is essential for designing more accurate and efficient statistical models. It is not always possible to unravel the reasoning of neural networks. Here, we propose a method for calculating machine learning kernels in closed and analytic form by combining atomic property weighted radial distribution function (AP-RDF) descriptor with a Gaussian kernel. This allowed us to analyze and improve the performance of the Bag-of-Bonds descriptor when the bond type restriction is included in AP-RDF. The improvement is achieved for the prediction of molecular atomization energies (MAE = 1.7 kcal/mol for QM7 data set) and is due to the incorporation of a tensor product into the kernel, which captures the multidimensional representation of the AP-RDF. On the other hand, the numerical version of the AP-RDF is a constant size descriptor, making it more computationally efficient than Bag-of-Bonds. We have also discussed a connection between molecular quantum similarity and machine learning kernels with first-principles kinds of descriptors.
理解机器学习算法的性能对于设计更准确和高效的统计模型至关重要。神经网络的推理并不总是能够被揭示。在这里,我们提出了一种通过将原子特性加权径向分布函数(AP-RDF)描述符与高斯核相结合来计算机器学习核的封闭和解析形式的方法。这使得我们能够分析和改进当在 AP-RDF 中包含键类型限制时的键合描述符的性能。这种改进是通过在核中引入张量积来实现的,该张量积捕获了 AP-RDF 的多维表示,从而实现了对分子原子化能的预测(对于 QM7 数据集,MAE=1.7 kcal/mol)。另一方面,AP-RDF 的数值版本是一个固定大小的描述符,使其比键合描述符更具计算效率。我们还讨论了分子量子相似性与基于第一性原理的描述符的机器学习核之间的联系。