使用贝塔二项式估计集成模型的分类不确定性。

Using beta binomials to estimate classification uncertainty for ensemble models.

机构信息

Department of Life Sciences, Simulations Plus, Inc., 45205 10th Street West, Lancaster, CA 93534, USA.

出版信息

J Cheminform. 2014 Jun 22;6:34. doi: 10.1186/1758-2946-6-34. eCollection 2014.

DOI:10.1186/1758-2946-6-34

PMID:24987464

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4076254/

Abstract

BACKGROUND

Quantitative structure-activity (QSAR) models have enormous potential for reducing drug discovery and development costs as well as the need for animal testing. Great strides have been made in estimating their overall reliability, but to fully realize that potential, researchers and regulators need to know how confident they can be in individual predictions.

RESULTS

Submodels in an ensemble model which have been trained on different subsets of a shared training pool represent multiple samples of the model space, and the degree of agreement among them contains information on the reliability of ensemble predictions. For artificial neural network ensembles (ANNEs) using two different methods for determining ensemble classification - one using vote tallies and the other averaging individual network outputs - we have found that the distribution of predictions across positive vote tallies can be reasonably well-modeled as a beta binomial distribution, as can the distribution of errors. Together, these two distributions can be used to estimate the probability that a given predictive classification will be in error. Large data sets comprised of logP, Ames mutagenicity, and CYP2D6 inhibition data are used to illustrate and validate the method. The distributions of predictions and errors for the training pool accurately predicted the distribution of predictions and errors for large external validation sets, even when the number of positive and negative examples in the training pool were not balanced. Moreover, the likelihood of a given compound being prospectively misclassified as a function of the degree of consensus between networks in the ensemble could in most cases be estimated accurately from the fitted beta binomial distributions for the training pool.

CONCLUSIONS

Confidence in an individual predictive classification by an ensemble model can be accurately assessed by examining the distributions of predictions and errors as a function of the degree of agreement among the constituent submodels. Further, ensemble uncertainty estimation can often be improved by adjusting the voting or classification threshold based on the parameters of the error distribution. Finally, the profiles for models whose predictive uncertainty estimates are not reliable provide clues to that effect without the need for comparison to an external test set.

摘要

背景

定量构效关系（QSAR）模型在降低药物发现和开发成本以及减少动物测试方面具有巨大潜力。在评估其整体可靠性方面已经取得了巨大进展，但为了充分发挥这一潜力，研究人员和监管机构需要了解他们对个别预测的信心程度。

结果

集成模型中的子模型是在共享训练池的不同子集上进行训练的，代表了模型空间的多个样本，它们之间的一致性程度包含了关于集成预测可靠性的信息。对于使用两种不同方法确定集成分类的人工神经网络集成（ANNEs） - 一种使用投票计数，另一种平均单个网络输出 - 我们发现，跨正投票计数的预测分布可以合理地建模为贝塔二项式分布，误差分布也是如此。这两个分布可以一起用于估计给定预测分类错误的概率。使用较大的数据集，包括 logP、Ames 致突变性和 CYP2D6 抑制数据，来说明和验证该方法。训练池的预测和误差分布准确预测了大型外部验证集的预测和误差分布，即使在训练池中正负例的数量不平衡的情况下也是如此。此外，在大多数情况下，可以从训练池中拟合的贝塔二项式分布准确估计给定化合物作为集成中网络之间一致性程度的函数被前瞻性错误分类的可能性。

结论

通过检查作为组成子模型之间一致性程度的函数的预测和误差分布，可以准确评估集成模型中单个预测分类的置信度。此外，通过根据误差分布的参数调整投票或分类阈值，通常可以提高集成不确定性估计。最后，对于预测不确定性估计不可靠的模型的轮廓，无需与外部测试集进行比较即可提供有关该效果的线索。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e86/4076254/a88abb1e9b90/1758-2946-6-34-1.jpg

相似文献

Using beta binomials to estimate classification uncertainty for ensemble models.使用贝塔二项式估计集成模型的分类不确定性。

J Cheminform. 2014 Jun 22;6:34. doi: 10.1186/1758-2946-6-34. eCollection 2014.

DPRESS: Localizing estimates of predictive uncertainty.DPRESS：本地化预测不确定性的估计。

J Cheminform. 2009 Jul 14;1(1):11. doi: 10.1186/1758-2946-1-11.

An ensemble model of QSAR tools for regulatory risk assessment.用于监管风险评估的QSAR工具集成模型。

J Cheminform. 2016 Sep 22;8:48. doi: 10.1186/s13321-016-0164-0. eCollection 2016.

General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity.定量构效关系预测分子活性的误差估计的一般方法。

J Chem Inf Model. 2018 Aug 27;58(8):1561-1575. doi: 10.1021/acs.jcim.8b00114. Epub 2018 Jul 17.

Applicability Domain Dependent Predictive Uncertainty in QSAR Regressions.定量构效关系（QSAR）回归中与适用域相关的预测不确定性

Mol Inform. 2014 Jan;33(1):26-35. doi: 10.1002/minf.201200131. Epub 2013 Oct 7.

Fast uncertainty estimates in deep learning interatomic potentials.深度学习原子间势中的快速不确定性估计。

J Chem Phys. 2023 Apr 28;158(16). doi: 10.1063/5.0136574.

J Chem Inf Model. 2019 Jan 28;59(1):181-189. doi: 10.1021/acs.jcim.8b00597. Epub 2018 Nov 19.

A consensus approach for estimating the predictive accuracy of dynamic models in biology.一种用于估计生物学中动态模型预测准确性的共识方法。

Comput Methods Programs Biomed. 2015 Apr;119(1):17-28. doi: 10.1016/j.cmpb.2015.02.001. Epub 2015 Feb 11.

QSAR with experimental and predictive distributions: an information theoretic approach for assessing model quality.定量构效关系（QSAR）与实验和预测分布：评估模型质量的信息理论方法。

J Comput Aided Mol Des. 2013 Mar;27(3):203-19. doi: 10.1007/s10822-013-9639-5. Epub 2013 Mar 16.

Phenotype recognition with combined features and random subspace classifier ensemble.基于组合特征和随机子空间分类器集成的表型识别。

BMC Bioinformatics. 2011 Apr 30;12:128. doi: 10.1186/1471-2105-12-128.

引用本文的文献

Synergistic Inhibition Guided Fragment-Linking Strategy and Quantitative Structure-Property Relationship Modeling To Design Inhalable Therapeutics for Asthma Targeting CSF1R.协同抑制导向的片段连接策略与定量构效关系建模，用于设计靶向CSF1R的哮喘吸入疗法。

ACS Omega. 2023 Jun 1;8(23):20505-20512. doi: 10.1021/acsomega.3c00803. eCollection 2023 Jun 13.

Computational determination of toxicity risks associated with a selection of approved drugs having demonstrated activity against COVID-19.计算确定与已证明对 COVID-19 具有活性的选定批准药物相关的毒性风险。

BMC Pharmacol Toxicol. 2021 Oct 21;22(1):61. doi: 10.1186/s40360-021-00519-5.

A path to next-generation reproducibility in cheminformatics.化学信息学中实现下一代可重复性的途径。

J Cheminform. 2019 Oct 14;11(1):62. doi: 10.1186/s13321-019-0385-0.

Design and tests of prospective property predictions for novel antimalarial 2-aminopropylaminoquinolones.新型抗疟 2-氨丙基氨基喹啉类化合物前瞻性属性预测的设计与测试。

J Comput Aided Mol Des. 2020 Nov;34(11):1117-1132. doi: 10.1007/s10822-020-00333-x. Epub 2020 Aug 24.

Putting deep learning in perspective for pest management scientists.为害虫管理科学家正确看待深度学习。

Pest Manag Sci. 2020 Jul;76(7):2267-2275. doi: 10.1002/ps.5820. Epub 2020 Apr 10.

Predicting mammalian metabolism and toxicity of pesticides in silico.利用计算机模拟预测哺乳动物对农药的代谢和毒性。

Pest Manag Sci. 2018 May 15;74(9):1992-2003. doi: 10.1002/ps.4935.

Tales from the war on error: the art and science of curating QSAR data.误差之战的故事：整理定量构效关系（QSAR）数据的艺术与科学

J Comput Aided Mol Des. 2015 Sep;29(9):897-910. doi: 10.1007/s10822-015-9865-0. Epub 2015 Aug 20.

Assessment of uncertainty in chemical models by Bayesian probabilities: Why, when, how?用贝叶斯概率评估化学模型中的不确定性：为何、何时以及如何？

J Comput Aided Mol Des. 2015 Jul;29(7):583-94. doi: 10.1007/s10822-014-9822-3. Epub 2014 Dec 10.

How accurately can we predict the melting points of drug-like compounds?我们能多准确地预测类药性化合物的熔点？

J Chem Inf Model. 2014 Dec 22;54(12):3320-9. doi: 10.1021/ci5005288. Epub 2014 Dec 9.

本文引用的文献

Applicability Domain Dependent Predictive Uncertainty in QSAR Regressions.定量构效关系（QSAR）回归中与适用域相关的预测不确定性

Mol Inform. 2014 Jan;33(1):26-35. doi: 10.1002/minf.201200131. Epub 2013 Oct 7.

Using random forest to model the domain applicability of another random forest model.使用随机森林模型来模拟另一个随机森林模型的领域适用性。

J Chem Inf Model. 2013 Nov 25;53(11):2837-50. doi: 10.1021/ci400482e. Epub 2013 Nov 5.

Uncertainty in QSAR predictions.QSAR 预测中的不确定性。

Altern Lab Anim. 2013 Mar;41(1):111-25. doi: 10.1177/026119291304100111.

J Comput Aided Mol Des. 2013 Mar;27(3):203-19. doi: 10.1007/s10822-013-9639-5. Epub 2013 Mar 16.

Prediction of Cytochrome P450 Profiles of Environmental Chemicals with QSAR Models Built from Drug-like Molecules.利用基于类药物分子构建的定量构效关系模型预测环境化学物质的细胞色素P450谱

Mol Inform. 2012 Nov 1;31(11-12):783-792. doi: 10.1002/minf.201200065. Epub 2012 Oct 11.

Interpretable, probability-based confidence metric for continuous quantitative structure-activity relationship models.基于概率的可解释性置信度度量方法，用于连续的定量构效关系模型。

J Chem Inf Model. 2013 Feb 25;53(2):368-83. doi: 10.1021/ci300554t. Epub 2013 Feb 5.

Applicability domains for classification problems: Benchmarking of distance to models for Ames mutagenicity set.分类问题的适用域：Ames 致突变性集模型距离的基准测试。

J Chem Inf Model. 2010 Dec 27;50(12):2094-111. doi: 10.1021/ci100253r. Epub 2010 Oct 29.

DPRESS: Localizing estimates of predictive uncertainty.DPRESS：本地化预测不确定性的估计。

J Cheminform. 2009 Jul 14;1(1):11. doi: 10.1186/1758-2946-1-11.

Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.全面表征化学文库中环氧化酶同工酶选择性。

Nat Biotechnol. 2009 Nov;27(11):1050-5. doi: 10.1038/nbt.1581. Epub 2009 Oct 25.

Benchmark data set for in silico prediction of Ames mutagenicity.用于计算机模拟预测埃姆斯致突变性的基准数据集。

J Chem Inf Model. 2009 Sep;49(9):2077-81. doi: 10.1021/ci900161g.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用贝塔二项式估计集成模型的分类不确定性。

Using beta binomials to estimate classification uncertainty for ensemble models.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献