Azencott Chloé-Agathe, Ksikes Alexandre, Swamidass S Joshua, Chen Jonathan H, Ralaivola Liva, Baldi Pierre
School of Information and Computer Sciences, University of California-Irvine, Irvine, California 92697-3435, USA.
J Chem Inf Model. 2007 May-Jun;47(3):965-74. doi: 10.1021/ci600397p. Epub 2007 Mar 6.
Many chemoinformatics applications, including high-throughput virtual screening, benefit from being able to rapidly predict the physical, chemical, and biological properties of small molecules to screen large repositories and identify suitable candidates. When training sets are available, machine learning methods provide an effective alternative to ab initio methods for these predictions. Here, we leverage rich molecular representations including 1D SMILES strings, 2D graphs of bonds, and 3D coordinates to derive efficient machine learning kernels to address regression problems. We further expand the library of available spectral kernels for small molecules developed for classification problems to include 2.5D surface and 3D kernels using Delaunay tetrahedrization and other techniques from computational geometry, 3D pharmacophore kernels, and 3.5D or 4D kernels capable of taking into account multiple molecular configurations, such as conformers. The kernels are comprehensively tested using cross-validation and redundancy-reduction methods on regression problems using several available data sets to predict boiling points, melting points, aqueous solubility, octanol/water partition coefficients, and biological activity with state-of-the art results. When sufficient training data are available, 2D spectral kernels in general tend to yield the best and most robust results, better than state-of-the art. On data sets containing thousands of molecules, the kernels achieve a squared correlation coefficient of 0.91 for aqueous solubility prediction and 0.94 for octanol/water partition coefficient prediction. Averaging over conformations improves the performance of kernels based on the three-dimensional structure of molecules, especially on challenging data sets. Kernel predictors for aqueous solubility (kSOL), LogP (kLOGP), and melting point (kMELT) are available over the Web through: http://cdb.ics.uci.edu.
许多化学信息学应用,包括高通量虚拟筛选,都受益于能够快速预测小分子的物理、化学和生物学性质,以便筛选大型数据库并识别合适的候选物。当有训练集可用时,机器学习方法为这些预测提供了一种有效的替代从头计算方法的选择。在这里,我们利用丰富的分子表示,包括1D SMILES字符串、2D化学键图和3D坐标,来推导高效的机器学习核,以解决回归问题。我们进一步扩展了为分类问题开发的小分子可用光谱核库,包括使用德劳内四面体化和计算几何中的其他技术的2.5D表面核和3D核、3D药效团核以及能够考虑多种分子构型(如构象异构体)的3.5D或4D核。使用交叉验证和冗余减少方法,在几个可用数据集上对回归问题进行全面测试,以预测沸点、熔点、水溶性、辛醇/水分配系数和生物活性,得到了先进的结果。当有足够的训练数据时,一般来说,2D光谱核往往能产生最好、最稳健的结果,优于现有技术。在包含数千个分子的数据集上,这些核在水溶性预测方面的平方相关系数达到0.91,在辛醇/水分配系数预测方面达到0.94。对构象进行平均可提高基于分子三维结构的核的性能,特别是在具有挑战性的数据集上。水溶性(kSOL)、LogP(kLOGP)和熔点(kMELT)的核预测器可通过以下网址在网上获取:http://cdb.ics.uci.edu。