Tetko Igor V, Sushko Iurii, Pandey Anil Kumar, Zhu Hao, Tropsha Alexander, Papa Ester, Oberg Tomas, Todeschini Roberto, Fourches Denis, Varnek Alexandre
Helmholtz Zentrum Munchen-German Research Center for Environmental Health (GmbH), Institute of Bioinformatics and Systems Biology, Neuherberg D-85764, Germany.
J Chem Inf Model. 2008 Sep;48(9):1733-46. doi: 10.1021/ci800151m. Epub 2008 Aug 26.
The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based on standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site.
预测准确性的评估是定量构效关系(QSAR)建模中的一个关键问题。“与模型的距离”可以定义为一种度量,它在特定模型的背景下,针对给定性质定义训练集分子与测试集化合物之间的相似性。它可以用许多不同的方式来表达,例如,使用塔尼莫托系数、杠杆率、模型空间中的相关性等。在本文中,我们使用了高斯分布的混合以及统计检验,以评估六种类型的与模型的距离在区分具有小预测误差和大预测误差的化合物方面的能力。对使用不同机器学习方法和各种类型描述符获得的针对梨形四膜虫的十二种水毒性QSAR模型进行了分析。基于从模型集合计算出的预测毒性标准差的与模型的距离给出了最佳结果。对于使用log P和最大受体超离域化描述符开发的基于机制的模型,这种距离也成功地区分了具有低预测误差和高预测误差的分子。因此,与模型的距离度量也可用于通过估计其预测误差来增强基于机制的QSAR模型。此外,预测的准确性主要由化学和活性空间中的训练集数据分布决定,而不是由用于开发模型的QSAR方法决定。我们已经表明,模型的错误验证可能导致对其性能的错误估计,并提出了如何规避这个问题的方法。分别预测了来自美国环境保护局高产量(HPV)挑战计划和欧洲现有商业化学物质目录(EINECS)的3182个和48774个分子的毒性,并估计了预测的准确性。所开发的模型可在http://www.qspr.org网站上在线获取。