用于预测水溶性的随机森林模型。

Random forest models to predict aqueous solubility.

作者信息

Palmer David S, O'Boyle Noel M, Glen Robert C, Mitchell John B O

机构信息

Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK.

出版信息

J Chem Inf Model. 2007 Jan-Feb;47(1):150-8. doi: 10.1021/ci060164k.

DOI:10.1021/ci060164k

PMID:17238260

Abstract

Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.

摘要

基于988个有机分子的实验数据，使用随机森林回归（RF）、偏最小二乘（PLS）回归、支持向量机（SVM）和人工神经网络（ANN）开发了用于预测水溶性的定量构效关系（QSPR）模型。随机森林回归模型预测水溶性的准确性高于PLS、SVM和ANN创建的模型，并提供了自动描述符选择、描述符重要性评估和预测能力并行测量的方法，所有这些都有助于推荐其使用。对330个在25℃下为固体的分子的外部测试集的对数摩尔溶解度预测得到r2 = 0.89和RMSE = 0.69 log S单位。对于从文献中选择的标准数据集，该模型相对于其他已记录的方法表现良好。最后，根据回归分析选择的分子描述符，将训练集和测试集的多样性与MDL药物数据报告中分子占据的化学空间进行比较。