Votano Joseph R, Parham Marc, Hall Lowell H, Kier Lemont B, Hall L Mark
ChemSilico LLC, 48 Baldwin Street, Tewksbury, MA 01876, USA.
Chem Biodivers. 2004 Nov;1(11):1829-41. doi: 10.1002/cbdv.200490137.
Several QSPR models were developed for predicting intrinsic aqueous solubility, S(o). A data set of 5,964 neutral compounds was sub-divided into two classes, aromatic and non-aromatic compounds. Three models were created with different methods on both data sets: two regression models (multiple linear regression and partial least squares) and an artificial neural network model. These models were based on 3343 aromatic and 1674 non-aromatic compounds for training sets; 938 compounds were used in external validation testing. The range in -log S(o) is -1.6 to 10. Topological structure descriptors were used with all models. A genetic algorithm was used for descriptor selection for regression models. For the artificial neural network (ANN) model, descriptor selection was done with a backward elimination process. All models performed well with r2 values ranging 0.72 to 0.84 in external validation testing. The mean absolute errors in validation ranged from 0.44 to 0.80 for the classes of compounds for all the models. These statistical results indicate a sound ANN model. Furthermore, in a comparison with eight other available models, based on predictions using a validation test set (442 compounds), the artificial neural network model presented in this work (CSLogWS) was clearly superior based on both the mean absolute error and the percentage of residuals less than one log unit. In the ANN model both E-State and hydrogen E-State descriptors were found to be important.
开发了几种定量构效关系(QSPR)模型来预测固有水溶解度S(o)。一个包含5964种中性化合物的数据集被细分为两类,即芳香族化合物和非芳香族化合物。在这两个数据集上使用不同方法创建了三个模型:两个回归模型(多元线性回归和偏最小二乘法)和一个人工神经网络模型。这些模型基于3343种芳香族化合物和1674种非芳香族化合物作为训练集;938种化合物用于外部验证测试。-log S(o)的范围是-1.6至10。所有模型均使用拓扑结构描述符。遗传算法用于回归模型的描述符选择。对于人工神经网络(ANN)模型,描述符选择通过反向消除过程进行。所有模型在外部验证测试中表现良好,r2值范围为0.72至0.84。所有模型中各类化合物验证的平均绝对误差范围为0.44至0.80。这些统计结果表明了一个可靠的人工神经网络模型。此外,与其他八个可用模型相比,基于使用验证测试集(442种化合物)的预测,本文提出的人工神经网络模型(CSLogWS)在平均绝对误差和残差小于一个对数单位的百分比方面均明显更优。在人工神经网络模型中,发现E态和氢E态描述符都很重要。