Suppr超能文献

使用贝塔二项式估计集成模型的分类不确定性。

Using beta binomials to estimate classification uncertainty for ensemble models.

机构信息

Department of Life Sciences, Simulations Plus, Inc., 45205 10th Street West, Lancaster, CA 93534, USA.

出版信息

J Cheminform. 2014 Jun 22;6:34. doi: 10.1186/1758-2946-6-34. eCollection 2014.

Abstract

BACKGROUND

Quantitative structure-activity (QSAR) models have enormous potential for reducing drug discovery and development costs as well as the need for animal testing. Great strides have been made in estimating their overall reliability, but to fully realize that potential, researchers and regulators need to know how confident they can be in individual predictions.

RESULTS

Submodels in an ensemble model which have been trained on different subsets of a shared training pool represent multiple samples of the model space, and the degree of agreement among them contains information on the reliability of ensemble predictions. For artificial neural network ensembles (ANNEs) using two different methods for determining ensemble classification - one using vote tallies and the other averaging individual network outputs - we have found that the distribution of predictions across positive vote tallies can be reasonably well-modeled as a beta binomial distribution, as can the distribution of errors. Together, these two distributions can be used to estimate the probability that a given predictive classification will be in error. Large data sets comprised of logP, Ames mutagenicity, and CYP2D6 inhibition data are used to illustrate and validate the method. The distributions of predictions and errors for the training pool accurately predicted the distribution of predictions and errors for large external validation sets, even when the number of positive and negative examples in the training pool were not balanced. Moreover, the likelihood of a given compound being prospectively misclassified as a function of the degree of consensus between networks in the ensemble could in most cases be estimated accurately from the fitted beta binomial distributions for the training pool.

CONCLUSIONS

Confidence in an individual predictive classification by an ensemble model can be accurately assessed by examining the distributions of predictions and errors as a function of the degree of agreement among the constituent submodels. Further, ensemble uncertainty estimation can often be improved by adjusting the voting or classification threshold based on the parameters of the error distribution. Finally, the profiles for models whose predictive uncertainty estimates are not reliable provide clues to that effect without the need for comparison to an external test set.

摘要

背景

定量构效关系(QSAR)模型在降低药物发现和开发成本以及减少动物测试方面具有巨大潜力。在评估其整体可靠性方面已经取得了巨大进展,但为了充分发挥这一潜力,研究人员和监管机构需要了解他们对个别预测的信心程度。

结果

集成模型中的子模型是在共享训练池的不同子集上进行训练的,代表了模型空间的多个样本,它们之间的一致性程度包含了关于集成预测可靠性的信息。对于使用两种不同方法确定集成分类的人工神经网络集成(ANNEs) - 一种使用投票计数,另一种平均单个网络输出 - 我们发现,跨正投票计数的预测分布可以合理地建模为贝塔二项式分布,误差分布也是如此。这两个分布可以一起用于估计给定预测分类错误的概率。使用较大的数据集,包括 logP、Ames 致突变性和 CYP2D6 抑制数据,来说明和验证该方法。训练池的预测和误差分布准确预测了大型外部验证集的预测和误差分布,即使在训练池中正负例的数量不平衡的情况下也是如此。此外,在大多数情况下,可以从训练池中拟合的贝塔二项式分布准确估计给定化合物作为集成中网络之间一致性程度的函数被前瞻性错误分类的可能性。

结论

通过检查作为组成子模型之间一致性程度的函数的预测和误差分布,可以准确评估集成模型中单个预测分类的置信度。此外,通过根据误差分布的参数调整投票或分类阈值,通常可以提高集成不确定性估计。最后,对于预测不确定性估计不可靠的模型的轮廓,无需与外部测试集进行比较即可提供有关该效果的线索。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e86/4076254/a88abb1e9b90/1758-2946-6-34-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验