Lucić Bono, Nadramija Damir, Basic Ivan, Trinajstić Nenad
The Rugjer Bosković Institute, P.O. Box 180, HR-10002 Zagreb, Croatia.
J Chem Inf Comput Sci. 2003 Jul-Aug;43(4):1094-102. doi: 10.1021/ci025636j.
In this study we want to test whether a simple modeling procedure used in the field of QSAR/QSPR can produce simple models that will be, at the same time, as accurate as robust Neural Network Ensemble (NNE) ones. We present results of application of two procedures for generating/selecting simple linear and nonlinear multiregression (MR) models: (1) method for selecting the best possible MR models (named as CROMRsel) and (2) Genetic Function Approximation (GFA) method from the Cerius2 program package. The obtained MR models are strictly compared with several NNE models. For the comparison we selected four QSAR data sets previously studied by NNE (Tetko et al. J. Chem. Inf. Comput. Sci. 1996, 36, 794-803. Kovalishyn et al. J. Chem. Inf. Comput. Sci. 1998, 38, 651-659.): (1) 51 benzodiazepine derivatives, (2) 37 carboquinone derivatives, (3) 74 pyrimidines, and (4) 31 antimycin analogues. These data sets were parameterized with 7, 6, 27, and 53 descriptors, respectively. Modeled properties were anti-pentylenetetrazole activity, antileukemic activity, inhibition constants to dihydrofolate reductase from MB1428 E. coli, and antifilarial activity, respectively. Nonlinearities were introduced into the MR models through 2-fold and/or 3-fold cross-products of initial (linear) descriptors. Then, using the CROMRsel and GFA programs (J. Chem. Inf. Comput. Sci. 1999, 39, 121-132) the sets of I (I < or = 8, in this paper) the best descriptors (according to the fit and leave-one-out correlation coefficients) were selected for multiregression models. Two classes of models were obtained: (1) linear or nonlinear MR models which were generated starting from the complete set of descriptors, and (2) nonlinear MR models which were generated starting from the same set of descriptors that was used in the NNE modeling. In addition, the descriptor selection method from CROMRsel was compared with the GFA method included in the QSAR module of the Cerius2 program. For each data set it has been found that the MR models have better cross-validated statistical parameters than the corresponding NNE models and that CROMRsel selects somewhat better MR models than the GFA method. MR models are also much simpler than NNEs, which is the important surprising fact, and, additionally, express calculated dependencies in a functional form. Moreover, MR models were shown to be better than all other models obtained by different methods on the same data sets ("old" multivariate regressions, functional-link-net models, back-propagation neural networks, genetic algorithm, and partial least squares models). This study also indicated that the robust NNE models cannot generate good models when applied on small data sets, suggesting that it is perhaps better to apply robust methods (like NNE ones) on larger data sets.
在本研究中,我们想要测试定量构效关系/定量结构性质关系(QSAR/QSPR)领域中使用的一种简单建模程序是否能够生成简单模型,这些模型同时具有与强大的神经网络集成(NNE)模型一样的准确性。我们展示了两种生成/选择简单线性和非线性多元回归(MR)模型的程序的应用结果:(1)选择最佳可能MR模型的方法(命名为CROMRsel)和(2)来自Cerius2程序包的遗传函数逼近(GFA)方法。将得到的MR模型与几个NNE模型进行严格比较。为了进行比较,我们选择了之前NNE研究过的四个QSAR数据集(Tetko等人,《化学信息与计算机科学杂志》,1996年,36卷,794 - 803页。Kovalishyn等人,《化学信息与计算机科学杂志》,1998年,38卷,651 - 659页):(1)51种苯二氮䓬衍生物,(2)37种卡波醌衍生物,(3)74种嘧啶,以及(4)31种抗霉素类似物。这些数据集分别用7、6、27和53个描述符进行参数化。建模的性质分别是抗戊四氮活性、抗白血病活性、对大肠杆菌MB1428二氢叶酸还原酶的抑制常数以及抗丝虫活性。通过初始(线性)描述符的2倍和/或3倍叉积将非线性引入到MR模型中。然后,使用CROMRsel和GFA程序(《化学信息与计算机科学杂志》,1999年,39卷,121 - 132页)为多元回归模型选择I(I≤8,本文中)个最佳描述符(根据拟合和留一法相关系数)。得到了两类模型:(1)从完整描述符集开始生成的线性或非线性MR模型,以及(2)从与NNE建模中使用的相同描述符集开始生成的非线性MR模型。此外,将CROMRsel中的描述符选择方法与Cerius2程序的QSAR模块中包含的GFA方法进行比较。对于每个数据集,已发现MR模型具有比相应NNE模型更好的交叉验证统计参数,并且CROMRsel选择的MR模型比GFA方法稍好。MR模型也比NNE模型简单得多,这是一个重要的惊人事实,并且此外,以函数形式表达计算出的相关性。此外,在相同数据集上,MR模型被证明比通过不同方法获得的所有其他模型(“旧”多元回归、函数链接网络模型、反向传播神经网络、遗传算法和偏最小二乘模型)都要好。这项研究还表明,强大的NNE模型应用于小数据集时不能生成良好的模型,这表明也许最好将强大的方法(如NNE方法)应用于更大的数据集。