Wang Linhua, Law Jeffrey, Kale Shiv D, Murali T M, Pandey Gaurav
Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.
Genetics, Bioinformatics, and Computational Biology Ph.D. Program, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA.
F1000Res. 2018 Sep 28;7. doi: 10.12688/f1000research.16415.1. eCollection 2018.
Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred ( https://github.com/GauravPandeyLab/LargeGOPred).
在给定问题中理想数据类型和/或单个预测器不明确的情况下,异构集成是一种有效的方法。这些集成方法在蛋白质功能预测(PFP)方面已显示出前景,但其在大规模上改善PFP的能力尚不清楚。本研究的总体目标是严格评估多种异构集成方法在众多功能术语、蛋白质和生物体中的这种能力。我们的结果表明,这些方法,尤其是使用逻辑回归的堆叠法,确实能对各种大小和特异性不同的基因本体术语产生更准确的预测。为了使这些方法能够应用于其他相关问题,我们已将这项工作所基于的启用HPC的代码作为LargeGOPred(https://github.com/GauravPandeyLab/LargeGOPred)公开发布。