Department of Defense Biotechnology High Performance Computing Software Applications Institute , Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command , Fort Detrick , Maryland 21702 , United States.
Defense Threat Reduction Agency , Aberdeen Proving Ground , Maryland 21010 , United States.
J Chem Inf Model. 2018 Aug 27;58(8):1561-1575. doi: 10.1021/acs.jcim.8b00114. Epub 2018 Jul 17.
Key requirements for quantitative structure-activity relationship (QSAR) models to gain acceptance by regulatory authorities include a defined domain of applicability (DA) and appropriate measures of goodness-of-fit, robustness, and predictivity. Hence, many DA metrics have been developed over the past two decades. The most intuitive are perhaps distance-to-model metrics, which are most commonly defined in terms of the mean distance between a molecule and its k nearest training samples. Detailed evaluations have shown that the variance of predictions by an ensemble of QSAR models may serve as a DA metric and can outperform distance-to-model metrics. Intriguingly, the performance of ensemble variance metric has led researchers to conclude that the error of predicting a new molecule does not depend on the input descriptors or machine-learning methods but on its distance to the training molecules. This implies that the distance to training samples may serve as the basis for developing a high-performance DA metric. In this article, we introduce a new Tanimoto distance-based DA metric called the sum of distance-weighted contributions (SDC), which takes into account contributions from all molecules in a training set. Using four acute chemical toxicity data sets of varying sizes and four other molecular property data sets, we demonstrate that SDC correlates well with the prediction error for all data sets regardless of the machine-learning methods and molecular descriptors used to build the QSAR models. Using the acute toxicity data sets, we compared the distribution of prediction errors with respect to SDC, the mean distance to k-nearest training samples, and the variance of random forest predictions. The results showed that the correlation with the prediction error was highest for SDC. We also demonstrate that SDC allows for the development of robust root mean squared error (RMSE) models and makes it possible to not only give a QSAR prediction but also provide an individual RMSE estimate for each molecule. Because SDC does not depend on a specific machine-learning method, it represents a canonical measure that can be widely used to estimate individual molecule prediction errors for any machine-learning method.
定量构效关系 (QSAR) 模型获得监管机构认可的关键要求包括明确的适用性域 (DA) 和适当的拟合度、稳健性和预测性度量。因此,在过去的二十年中,已经开发了许多 DA 指标。最直观的可能是距离模型指标,最常见的定义是分子与其 k 个最近训练样本之间的平均距离。详细的评估表明,QSAR 模型集合的预测方差可以作为 DA 指标,并且可以胜过距离模型指标。有趣的是,集合方差度量的性能促使研究人员得出结论,预测新分子的误差不取决于输入描述符或机器学习方法,而是取决于其与训练分子的距离。这意味着与训练样本的距离可以作为开发高性能 DA 指标的基础。在本文中,我们引入了一种新的基于 Tanimoto 距离的 DA 指标,称为距离加权贡献之和 (SDC),它考虑了训练集中所有分子的贡献。使用四个大小不同的急性化学毒性数据集和另外四个分子性质数据集,我们证明 SDC 与所有数据集的预测误差都很好地相关,而与用于构建 QSAR 模型的机器学习方法和分子描述符无关。使用急性毒性数据集,我们比较了预测误差相对于 SDC、到 k 最近训练样本的平均距离和随机森林预测方差的分布。结果表明,SDC 与预测误差的相关性最高。我们还证明 SDC 允许开发稳健的均方根误差 (RMSE) 模型,不仅可以进行 QSAR 预测,还可以为每个分子提供单独的 RMSE 估计。因为 SDC 不依赖于特定的机器学习方法,所以它代表了一种可以广泛用于估计任何机器学习方法的单个分子预测误差的规范度量。