利用异构集成预测蛋白质功能和其他生物医学特征。

Predicting protein function and other biomedical characteristics with heterogeneous ensembles.

作者信息

Whalen Sean, Pandey Om Prakash, Pandey Gaurav

机构信息

Gladstone Institutes, University of California, San Francisco, CA, USA.

Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.

出版信息

Methods. 2016 Jan 15;93:92-102. doi: 10.1016/j.ymeth.2015.08.016. Epub 2015 Sep 2.

DOI:10.1016/j.ymeth.2015.08.016

PMID:26342255

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4718788/

Abstract

Prediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predictors that combine the output of diverse individual predictors that capture complementary aspects of the problems and/or datasets. In this paper, we demonstrate the potential of such heterogeneous ensembles, derived from stacking and ensemble selection methods, for addressing PFP and other similar biomedical prediction problems. Deeper analysis of these results shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional base predictors. Finally, to make the effective application of heterogeneous ensembles to large complex datasets (big data) feasible, we present DataSink, a distributed ensemble learning framework, and demonstrate its sound scalability using the examined datasets. DataSink is publicly available from https://github.com/shwhalen/datasink.

摘要

生物医学科学中的预测问题，包括蛋白质功能预测（PFP），通常相当困难。部分原因在于对感兴趣的细胞现象的了解不完整、用于预测的变量和测量的适用性及数据质量，以及对于特定问题的理想预测器缺乏共识。在这种情况下，提高预测性能的一种有效方法是构建异构集成预测器，它将捕捉问题和/或数据集互补方面的不同个体预测器的输出结合起来。在本文中，我们展示了源自堆叠和集成选择方法的此类异构集成在解决PFP和其他类似生物医学预测问题方面的潜力。对这些结果的深入分析表明，这些方法，尤其是堆叠方法的卓越预测能力，可归因于它们对集成学习过程以下方面的关注：（i）更好地平衡多样性和性能，（ii）更有效地校准输出，以及（iii）更稳健地纳入额外的基础预测器。最后，为了使异构集成有效地应用于大型复杂数据集（大数据）成为可能，我们提出了DataSink，一个分布式集成学习框架，并使用所研究的数据集展示了它良好的可扩展性。DataSink可从https://github.com/shwhalen/datasink公开获取。

相似文献

Predicting protein function and other biomedical characteristics with heterogeneous ensembles.利用异构集成预测蛋白质功能和其他生物医学特征。

Methods. 2016 Jan 15;93:92-102. doi: 10.1016/j.ymeth.2015.08.016. Epub 2015 Sep 2.

LEARNING PARSIMONIOUS ENSEMBLES FOR UNBALANCED COMPUTATIONAL GENOMICS PROBLEMS.学习用于不平衡计算基因组学问题的简约集成方法。

Pac Symp Biocomput. 2017;22:288-299. doi: 10.1142/9789813207813_0028.

Large-scale protein function prediction using heterogeneous ensembles.使用异构集成进行大规模蛋白质功能预测。

F1000Res. 2018 Sep 28;7. doi: 10.12688/f1000research.16415.1. eCollection 2018.

Network inference with ensembles of bi-clustering trees.基于二部聚类树集成的网络推断。

BMC Bioinformatics. 2019 Oct 28;20(1):525. doi: 10.1186/s12859-019-3104-y.

Forecasting Corn Yield With Machine Learning Ensembles.利用机器学习集成预测玉米产量

Front Plant Sci. 2020 Jul 31;11:1120. doi: 10.3389/fpls.2020.01120. eCollection 2020.

Constructing query-driven dynamic machine learning model with application to protein-ligand binding sites prediction.构建查询驱动的动态机器学习模型及其在蛋白质-配体结合位点预测中的应用。

IEEE Trans Nanobioscience. 2015 Jan;14(1):45-58. doi: 10.1109/TNB.2015.2394328.

Drug-target interaction prediction with tree-ensemble learning and output space reconstruction.基于树集成学习和输出空间重构的药物-靶标相互作用预测。

BMC Bioinformatics. 2020 Feb 7;21(1):49. doi: 10.1186/s12859-020-3379-z.

Ensemble blood glucose prediction in diabetes mellitus: A review.糖尿病患者的血糖集合预测：综述。

Comput Biol Med. 2022 Aug;147:105674. doi: 10.1016/j.compbiomed.2022.105674. Epub 2022 Jun 10.

Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.压缩大数据分析：一种用于高维多源数据集的集成元算法。

PLoS One. 2020 Aug 28;15(8):e0228520. doi: 10.1371/journal.pone.0228520. eCollection 2020.

Greedy and Linear Ensembles of Machine Learning Methods Outperform Single Approaches for QSPR Regression Problems.在定量构效关系回归问题中，机器学习方法的贪婪和线性集成比单一方法表现更优。

Mol Inform. 2015 Sep;34(9):634-47. doi: 10.1002/minf.201400122. Epub 2015 Mar 25.

引用本文的文献

Prediction of future dementia among patients with mild cognitive impairment (MCI) by integrating multimodal clinical data.通过整合多模态临床数据预测轻度认知障碍（MCI）患者未来的痴呆症。

Heliyon. 2024 Aug 22;10(17):e36728. doi: 10.1016/j.heliyon.2024.e36728. eCollection 2024 Sep 15.

Improving transparency of computational tools for variant effect prediction.提高变异效应预测计算工具的透明度。

Nat Genet. 2024 Jul;56(7):1324-1326. doi: 10.1038/s41588-024-01821-8.

A Comprehensive Youth Diabetes Epidemiological Data Set and Web Portal: Resource Development and Case Studies.青少年糖尿病综合流行病学数据集和门户网站：资源开发与案例研究。

JMIR Public Health Surveill. 2024 Jul 2;10:e53330. doi: 10.2196/53330.

Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients?ChatGPT 对全髋关节和膝关节置换患者来说是可靠的信息来源吗？

Bone Jt Open. 2024 Feb 15;5(2):139-146. doi: 10.1302/2633-1462.52.BJO-2023-0113.R1.

Developing better digital health measures of Parkinson's disease using free living data and a crowdsourced data analysis challenge.利用自由生活数据和众包数据分析挑战，开发更好的帕金森病数字健康测量方法。

PLOS Digit Health. 2023 Mar 28;2(3):e0000208. doi: 10.1371/journal.pdig.0000208. eCollection 2023 Mar.

Integrating multimodal data through interpretable heterogeneous ensembles.通过可解释的异构集成来整合多模态数据。

Bioinform Adv. 2022 Sep 12;2(1):vbac065. doi: 10.1093/bioadv/vbac065. eCollection 2022.

Integrating multimodal data through interpretable heterogeneous ensembles.通过可解释的异构集成来整合多模态数据。

bioRxiv. 2022 Jul 25:2020.05.29.123497. doi: 10.1101/2020.05.29.123497.

A Continuously Benchmarked and Crowdsourced Challenge for Rapid Development and Evaluation of Models to Predict COVID-19 Diagnosis and Hospitalization.用于快速开发和评估预测 COVID-19 诊断和住院模型的持续基准测试和众包挑战。

JAMA Netw Open. 2021 Oct 1;4(10):e2124946. doi: 10.1001/jamanetworkopen.2021.24946.

Gene function finding through cross-organism ensemble learning.通过跨物种集成学习进行基因功能发现。

BioData Min. 2021 Feb 12;14(1):14. doi: 10.1186/s13040-021-00239-w.

MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC-MS metabolomics data.MetaClean：一种基于机器学习的分类器，用于降低非靶向 LC-MS 代谢组学数据中假阳性峰的检测率。

Metabolomics. 2020 Oct 21;16(11):117. doi: 10.1007/s11306-020-01738-3.

本文引用的文献

Hierarchical ensemble methods for protein function prediction.用于蛋白质功能预测的分层集成方法。

ISRN Bioinform. 2014 May 4;2014:901419. doi: 10.1155/2014/901419. eCollection 2014.

Toward better benchmarking: challenge-based methods assessment in cancer genomics.迈向更好的基准测试：癌症基因组学中基于挑战的方法评估

Genome Biol. 2014 Sep 17;15(9):462. doi: 10.1186/s13059-014-0462-7.

Genetic interaction networks: better understand to better predict.基因相互作用网络：更好地理解以便更好地预测。

Front Genet. 2013 Dec 17;4:290. doi: 10.3389/fgene.2013.00290.

Protein function prediction using multilabel ensemble classification.基于多标签集成分类的蛋白质功能预测。

IEEE/ACM Trans Comput Biol Bioinform. 2013 Jul-Aug;10(4):1045-57. doi: 10.1109/TCBB.2013.111.

Synthetic sickness or lethality points at candidate combination therapy targets in glioblastoma.合成疾病或致死性指向胶质母细胞瘤候选联合治疗靶点。

Int J Cancer. 2013 Nov;133(9):2123-32. doi: 10.1002/ijc.28235. Epub 2013 Jun 4.

A large-scale evaluation of computational protein function prediction.大规模计算蛋白质功能预测评估。

Nat Methods. 2013 Mar;10(3):221-7. doi: 10.1038/nmeth.2340. Epub 2013 Jan 27.

Minimalist ensemble algorithms for genome-wide protein localization prediction.基因组范围内蛋白质定位预测的简约集成算法。

BMC Bioinformatics. 2012 Jul 3;13:157. doi: 10.1186/1471-2105-13-157.

Multiple genetic interaction experiments provide complementary information useful for gene function prediction.多项遗传交互实验提供了互补的信息，有助于基因功能预测。

PLoS Comput Biol. 2012;8(6):e1002559. doi: 10.1371/journal.pcbi.1002559. Epub 2012 Jun 21.

Ensemble sparse classification of Alzheimer's disease.阿尔茨海默病的集成稀疏分类。

Neuroimage. 2012 Apr 2;60(2):1106-16. doi: 10.1016/j.neuroimage.2012.01.055. Epub 2012 Jan 14.

Role of synthetic genetic interactions in understanding functional interactions among pathways.合成基因相互作用在理解信号通路间功能相互作用中的作用。

Pac Symp Biocomput. 2012:43-54.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验