使用异构集成进行大规模蛋白质功能预测。

Large-scale protein function prediction using heterogeneous ensembles.

作者信息

Wang Linhua, Law Jeffrey, Kale Shiv D, Murali T M, Pandey Gaurav

机构信息

Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.

Genetics, Bioinformatics, and Computational Biology Ph.D. Program, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA.

出版信息

F1000Res. 2018 Sep 28;7. doi: 10.12688/f1000research.16415.1. eCollection 2018.

DOI:10.12688/f1000research.16415.1

PMID:30450194

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6221071/

Abstract

Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred ( https://github.com/GauravPandeyLab/LargeGOPred).

摘要

在给定问题中理想数据类型和/或单个预测器不明确的情况下，异构集成是一种有效的方法。这些集成方法在蛋白质功能预测（PFP）方面已显示出前景，但其在大规模上改善PFP的能力尚不清楚。本研究的总体目标是严格评估多种异构集成方法在众多功能术语、蛋白质和生物体中的这种能力。我们的结果表明，这些方法，尤其是使用逻辑回归的堆叠法，确实能对各种大小和特异性不同的基因本体术语产生更准确的预测。为了使这些方法能够应用于其他相关问题，我们已将这项工作所基于的启用HPC的代码作为LargeGOPred（https://github.com/GauravPandeyLab/LargeGOPred）公开发布。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ae5/6221071/3f2f62313651/f1000research-7-17934-g0000.jpg

相似文献

Large-scale protein function prediction using heterogeneous ensembles.使用异构集成进行大规模蛋白质功能预测。

F1000Res. 2018 Sep 28;7. doi: 10.12688/f1000research.16415.1. eCollection 2018.

Predicting protein function and other biomedical characteristics with heterogeneous ensembles.利用异构集成预测蛋白质功能和其他生物医学特征。

Methods. 2016 Jan 15;93:92-102. doi: 10.1016/j.ymeth.2015.08.016. Epub 2015 Sep 2.

LEARNING PARSIMONIOUS ENSEMBLES FOR UNBALANCED COMPUTATIONAL GENOMICS PROBLEMS.学习用于不平衡计算基因组学问题的简约集成方法。

Pac Symp Biocomput. 2017;22:288-299. doi: 10.1142/9789813207813_0028.

PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data.PFP：利用蛋白质序列数据自动预测具有置信度分数的基因本体功能注释。

Proteins. 2009 Feb 15;74(3):566-82. doi: 10.1002/prot.22172.

Greedy and Linear Ensembles of Machine Learning Methods Outperform Single Approaches for QSPR Regression Problems.在定量构效关系回归问题中，机器学习方法的贪婪和线性集成比单一方法表现更优。

Mol Inform. 2015 Sep;34(9):634-47. doi: 10.1002/minf.201400122. Epub 2015 Mar 25.

Integrating multimodal data through interpretable heterogeneous ensembles.通过可解释的异构集成来整合多模态数据。

bioRxiv. 2022 Jul 25:2020.05.29.123497. doi: 10.1101/2020.05.29.123497.

Integrating multimodal data through interpretable heterogeneous ensembles.通过可解释的异构集成来整合多模态数据。

Bioinform Adv. 2022 Sep 12;2(1):vbac065. doi: 10.1093/bioadv/vbac065. eCollection 2022.

Predicting human protein function with multi-task deep neural networks.用多任务深度神经网络预测人类蛋白质功能。

PLoS One. 2018 Jun 11;13(6):e0198216. doi: 10.1371/journal.pone.0198216. eCollection 2018.

DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier.DeepGO：使用深度本体感知分类器从序列和相互作用预测蛋白质功能。

Bioinformatics. 2018 Feb 15;34(4):660-668. doi: 10.1093/bioinformatics/btx624.

Large-scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants.蛋白质序列的大规模自动化功能预测及PTEN转录变体的实验案例研究验证

Proteins. 2018 Feb;86(2):135-151. doi: 10.1002/prot.25416. Epub 2017 Nov 29.

引用本文的文献

Unveiling the role of IL7R in metabolism-associated fatty liver disease leading to hepatocellular carcinoma through transcriptomic and machine learning approaches.通过转录组学和机器学习方法揭示IL7R在导致肝细胞癌的代谢相关脂肪性肝病中的作用。

Discov Oncol. 2025 May 23;16(1):873. doi: 10.1007/s12672-025-02638-5.

Developing better digital health measures of Parkinson's disease using free living data and a crowdsourced data analysis challenge.利用自由生活数据和众包数据分析挑战，开发更好的帕金森病数字健康测量方法。

PLOS Digit Health. 2023 Mar 28;2(3):e0000208. doi: 10.1371/journal.pdig.0000208. eCollection 2023 Mar.

Integrating multimodal data through interpretable heterogeneous ensembles.通过可解释的异构集成来整合多模态数据。

Bioinform Adv. 2022 Sep 12;2(1):vbac065. doi: 10.1093/bioadv/vbac065. eCollection 2022.

Integrating multimodal data through interpretable heterogeneous ensembles.通过可解释的异构集成来整合多模态数据。

bioRxiv. 2022 Jul 25:2020.05.29.123497. doi: 10.1101/2020.05.29.123497.

PhotoModPlus: A web server for photosynthetic protein prediction from genome neighborhood features.PhotoModPlus：一个基于基因组邻近特征预测光合蛋白的网络服务器。

PLoS One. 2021 Mar 17;16(3):e0248682. doi: 10.1371/journal.pone.0248682. eCollection 2021.

Gene function finding through cross-organism ensemble learning.通过跨物种集成学习进行基因功能发现。

BioData Min. 2021 Feb 12;14(1):14. doi: 10.1186/s13040-021-00239-w.

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.CAFA 挑战赛报告称，通过实验筛选，提高了数百个基因的蛋白质功能预测和新的功能注释。

Genome Biol. 2019 Nov 19;20(1):244. doi: 10.1186/s13059-019-1835-8.

本文引用的文献

GOATOOLS: A Python library for Gene Ontology analyses.GOATOOLS：一个用于基因本体论分析的 Python 库。

Sci Rep. 2018 Jul 18;8(1):10872. doi: 10.1038/s41598-018-28948-z.

Learned protein embeddings for machine learning.机器学习的深度学习蛋白质嵌入。

Bioinformatics. 2018 Aug 1;34(15):2642-2648. doi: 10.1093/bioinformatics/bty178.

UniProt: the universal protein knowledgebase.通用蛋白质知识库：UniProt

Nucleic Acids Res. 2018 Mar 16;46(5):2699. doi: 10.1093/nar/gky092.

Interspecies gene function prediction using semantic similarity.基于语义相似性的跨物种基因功能预测

BMC Syst Biol. 2016 Dec 23;10(Suppl 4):121. doi: 10.1186/s12918-016-0361-5.

Possession, Use, and Transfer of Select Agents and Toxins; Biennial Review of the List of Select Agents and Toxins and Enhanced Biosafety Requirements. Final rule.特定病原体和毒素的占有、使用及转移；特定病原体和毒素清单的两年期审查及强化生物安全要求。最终规则。

Fed Regist. 2017 Jan 19;82(12):6278-94.

Expansion of the Gene Ontology knowledgebase and resources.基因本体知识库及资源的扩展。

Nucleic Acids Res. 2017 Jan 4;45(D1):D331-D338. doi: 10.1093/nar/gkw1108. Epub 2016 Nov 29.

LEARNING PARSIMONIOUS ENSEMBLES FOR UNBALANCED COMPUTATIONAL GENOMICS PROBLEMS.学习用于不平衡计算基因组学问题的简约集成方法。

Pac Symp Biocomput. 2017;22:288-299. doi: 10.1142/9789813207813_0028.

An expanded evaluation of protein function prediction methods shows an improvement in accuracy.对蛋白质功能预测方法的扩展评估显示准确性有所提高。

Genome Biol. 2016 Sep 7;17(1):184. doi: 10.1186/s13059-016-1037-6.

Mechanisms of Antimicrobial Resistance in ESKAPE Pathogens.ESKAPE 病原体中的抗菌耐药机制。

Biomed Res Int. 2016;2016:2475067. doi: 10.1155/2016/2475067. Epub 2016 May 5.

The khmer software package: enabling efficient nucleotide sequence analysis.高棉软件包：实现高效的核苷酸序列分析

F1000Res. 2015 Sep 25;4:900. doi: 10.12688/f1000research.6924.1. eCollection 2015.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用异构集成进行大规模蛋白质功能预测。

Large-scale protein function prediction using heterogeneous ensembles.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献