Tetko Igor V, Solov'ev Vitaly P, Antonov Alexey V, Yao Xiaojun, Doucet Jean Pierre, Fan Botao, Hoonakker Frank, Fourches Denis, Jost Piere, Lachiche Nicolas, Varnek Alexandre
Institute of Bioorganic & Petrochemistry, Kiev, Ukraine.
J Chem Inf Model. 2006 Mar-Apr;46(2):808-19. doi: 10.1021/ci0504216.
A benchmark of several popular methods, Associative Neural Networks (ANN), Support Vector Machines (SVM), k Nearest Neighbors (kNN), Maximal Margin Linear Programming (MMLP), Radial Basis Function Neural Network (RBFNN), and Multiple Linear Regression (MLR), is reported for quantitative-structure property relationships (QSPR) of stability constants logK1 for the 1:1 (M:L) and logbeta2 for 1:2 complexes of metal cations Ag+ and Eu3+ with diverse sets of organic molecules in water at 298 K and ionic strength 0.1 M. The methods were tested on three types of descriptors: molecular descriptors including E-state values, counts of atoms determined for E-state atom types, and substructural molecular fragments (SMF). Comparison of the models was performed using a 5-fold external cross-validation procedure. Robust statistical tests (bootstrap and Kolmogorov-Smirnov statistics) were employed to evaluate the significance of calculated models. The Wilcoxon signed-rank test was used to compare the performance of methods. Individual structure-complexation property models obtained with nonlinear methods demonstrated a significantly better performance than the models built using multilinear regression analysis (MLRA). However, the averaging of several MLRA models based on SMF descriptors provided as good of a prediction as the most efficient nonlinear techniques. Support Vector Machines and Associative Neural Networks contributed in the largest number of significant models. Models based on fragments (SMF descriptors and E-state counts) had higher prediction ability than those based on E-state indices. The use of SMF descriptors and E-state counts provided similar results, whereas E-state indices lead to less significant models. The current study illustrates the difficulties of quantitative comparison of different methods: conclusions based only on one data set without appropriate statistical tests could be wrong.
报告了几种常用方法的基准测试结果,这些方法包括关联神经网络(ANN)、支持向量机(SVM)、k近邻算法(kNN)、最大边缘线性规划(MMLP)、径向基函数神经网络(RBFNN)和多元线性回归(MLR),用于研究金属阳离子Ag+和Eu3+与多种有机分子在298K、离子强度为0.1M的水中形成的1:1(M:L)配合物的稳定常数logK1以及1:2配合物的logbeta2的定量结构-性质关系(QSPR)。这些方法在三种类型的描述符上进行了测试:分子描述符,包括E态值、根据E态原子类型确定的原子计数以及子结构分子片段(SMF)。使用5折外部交叉验证程序对模型进行比较。采用稳健的统计检验(自助法和柯尔莫哥洛夫-斯米尔诺夫统计量)来评估计算模型的显著性。使用威尔科克森符号秩检验来比较方法的性能。用非线性方法得到的个体结构-络合性质模型的性能明显优于使用多元线性回归分析(MLRA)构建的模型。然而,基于SMF描述符的多个MLRA模型的平均预测效果与最有效的非线性技术相当。支持向量机和关联神经网络在大量显著模型中贡献最大。基于片段(SMF描述符和E态计数)的模型比基于E态指数的模型具有更高的预测能力。使用SMF描述符和E态计数得到的结果相似,而E态指数导致的模型显著性较低。当前研究说明了不同方法进行定量比较的困难:仅基于一个数据集而没有适当统计检验得出的结论可能是错误的。