Suppr超能文献

利用异构集成预测蛋白质功能和其他生物医学特征。

Predicting protein function and other biomedical characteristics with heterogeneous ensembles.

作者信息

Whalen Sean, Pandey Om Prakash, Pandey Gaurav

机构信息

Gladstone Institutes, University of California, San Francisco, CA, USA.

Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.

出版信息

Methods. 2016 Jan 15;93:92-102. doi: 10.1016/j.ymeth.2015.08.016. Epub 2015 Sep 2.

Abstract

Prediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predictors that combine the output of diverse individual predictors that capture complementary aspects of the problems and/or datasets. In this paper, we demonstrate the potential of such heterogeneous ensembles, derived from stacking and ensemble selection methods, for addressing PFP and other similar biomedical prediction problems. Deeper analysis of these results shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional base predictors. Finally, to make the effective application of heterogeneous ensembles to large complex datasets (big data) feasible, we present DataSink, a distributed ensemble learning framework, and demonstrate its sound scalability using the examined datasets. DataSink is publicly available from https://github.com/shwhalen/datasink.

摘要

生物医学科学中的预测问题,包括蛋白质功能预测(PFP),通常相当困难。部分原因在于对感兴趣的细胞现象的了解不完整、用于预测的变量和测量的适用性及数据质量,以及对于特定问题的理想预测器缺乏共识。在这种情况下,提高预测性能的一种有效方法是构建异构集成预测器,它将捕捉问题和/或数据集互补方面的不同个体预测器的输出结合起来。在本文中,我们展示了源自堆叠和集成选择方法的此类异构集成在解决PFP和其他类似生物医学预测问题方面的潜力。对这些结果的深入分析表明,这些方法,尤其是堆叠方法的卓越预测能力,可归因于它们对集成学习过程以下方面的关注:(i)更好地平衡多样性和性能,(ii)更有效地校准输出,以及(iii)更稳健地纳入额外的基础预测器。最后,为了使异构集成有效地应用于大型复杂数据集(大数据)成为可能,我们提出了DataSink,一个分布式集成学习框架,并使用所研究的数据集展示了它良好的可扩展性。DataSink可从https://github.com/shwhalen/datasink公开获取。

相似文献

4
Network inference with ensembles of bi-clustering trees.基于二部聚类树集成的网络推断。
BMC Bioinformatics. 2019 Oct 28;20(1):525. doi: 10.1186/s12859-019-3104-y.
5
Forecasting Corn Yield With Machine Learning Ensembles.利用机器学习集成预测玉米产量
Front Plant Sci. 2020 Jul 31;11:1120. doi: 10.3389/fpls.2020.01120. eCollection 2020.
8
Ensemble blood glucose prediction in diabetes mellitus: A review.糖尿病患者的血糖集合预测:综述。
Comput Biol Med. 2022 Aug;147:105674. doi: 10.1016/j.compbiomed.2022.105674. Epub 2022 Jun 10.

引用本文的文献

6

本文引用的文献

1
Hierarchical ensemble methods for protein function prediction.用于蛋白质功能预测的分层集成方法。
ISRN Bioinform. 2014 May 4;2014:901419. doi: 10.1155/2014/901419. eCollection 2014.
4
Protein function prediction using multilabel ensemble classification.基于多标签集成分类的蛋白质功能预测。
IEEE/ACM Trans Comput Biol Bioinform. 2013 Jul-Aug;10(4):1045-57. doi: 10.1109/TCBB.2013.111.
6
A large-scale evaluation of computational protein function prediction.大规模计算蛋白质功能预测评估。
Nat Methods. 2013 Mar;10(3):221-7. doi: 10.1038/nmeth.2340. Epub 2013 Jan 27.
9
Ensemble sparse classification of Alzheimer's disease.阿尔茨海默病的集成稀疏分类。
Neuroimage. 2012 Apr 2;60(2):1106-16. doi: 10.1016/j.neuroimage.2012.01.055. Epub 2012 Jan 14.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验