优化用于蛋白质功能注释的高性能计算工作流程。

Optimizing high performance computing workflow for protein functional annotation.

作者信息

Stanberry Larissa, Rekepalli Bhanu, Liu Yuan, Giblock Paul, Higdon Roger, Montague Elizabeth, Broomall William, Kolker Natali, Kolker Eugene

机构信息

Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA.

Joint Institute for Computational Sciences, University of Tennessee - Oak Ridge National Laboratory (JICS UT - ORNL), DELSA Global, Oak Ridge, TN, USA.

出版信息

Concurr Comput. 2014 Sep 10;26(13):2112-2121. doi: 10.1002/cpe.3264.

DOI:10.1002/cpe.3264

PMID:25313296

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4194055/

Abstract

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

摘要

新测序基因组的功能注释是现代生物学的主要挑战之一。借助现代测序技术，蛋白质序列库正在迅速扩展。仅新测序的细菌基因组就包含超过750万种蛋白质。数据生成速度远远超过了蛋白质注释的速度。蛋白质数据量使得人工编目变得不可行，而高计算成本限制了现有自动化方法的实用性。在这项工作中，我们提出了一种改进和优化的自动化工作流程，以实现大规模蛋白质注释。该工作流程使用高性能计算架构和低复杂度分类算法，将蛋白质分配到现有的直系同源蛋白质组簇中。基于位置特异性迭代基本局部比对搜索工具，该算法确保所得分类的特异性和灵敏度至少为80%。该工作流程利用高度可扩展的并行应用程序进行分类和序列比对。使用极端科学与工程发现环境超级计算机，该工作流程处理了120万个新测序的细菌蛋白质。随着蛋白质序列库的迅速扩展，所提出的工作流程将使科学家能够注释大型基因组数据。

相似文献

Optimizing high performance computing workflow for protein functional annotation.优化用于蛋白质功能注释的高性能计算工作流程。

Concurr Comput. 2014 Sep 10;26(13):2112-2121. doi: 10.1002/cpe.3264.

PoPLAR: Portal for Petascale Lifescience Applications and Research.PoPLAR：大规模生命科学应用和研究门户。

BMC Bioinformatics. 2013;14 Suppl 9(Suppl 9):S3. doi: 10.1186/1471-2105-14-S9-S3. Epub 2013 Jun 28.

Functional annotation of a divergent genome using sequence and structure-based similarity.利用序列和结构相似性对分歧基因组进行功能注释。

BMC Genomics. 2024 Jan 2;25(1):6. doi: 10.1186/s12864-023-09924-y.

Massively Parallel Implementation of Sequence Alignment with Basic Local Alignment Search Tool Using Parallel Computing in Java Library.使用Java库中的并行计算通过基本局部比对搜索工具进行序列比对的大规模并行实现。

J Comput Biol. 2018 Aug;25(8):871-881. doi: 10.1089/cmb.2018.0079. Epub 2018 Jul 13.

WImpiBLAST: web interface for mpiBLAST to help biologists perform large-scale annotation using high performance computing.WImpiBLAST：mpiBLAST的网络界面，帮助生物学家利用高性能计算进行大规模注释。

PLoS One. 2014 Jun 30;9(6):e101144. doi: 10.1371/journal.pone.0101144. eCollection 2014.

COGNIZER: A Framework for Functional Annotation of Metagenomic Datasets.认知器：宏基因组数据集功能注释框架

PLoS One. 2015 Nov 11;10(11):e0142102. doi: 10.1371/journal.pone.0142102. eCollection 2015.

Transcriptome annotation in the cloud: complexity, best practices, and cost.转录组注释在云端：复杂性、最佳实践和成本。

Gigascience. 2021 Jan 29;10(2). doi: 10.1093/gigascience/giaa163.

SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.SS-Wrapper：用于在Linux集群上进行相似性搜索的一组包装应用程序。

BMC Bioinformatics. 2004 Oct 28;5:171. doi: 10.1186/1471-2105-5-171.

The Einstein Genome Gateway using WASP - a high throughput multi-layered life sciences portal for XSEDE.使用WASP的爱因斯坦基因组网关——面向极端科学与工程发现环境（XSEDE）的高通量多层生命科学门户。

Stud Health Technol Inform. 2012;175:182-91.

引用本文的文献

Protein Sequence Annotation Tool (PSAT): a centralized web-based meta-server for high-throughput sequence annotations.蛋白质序列注释工具（PSAT）：一个基于网络的集中式元服务器，用于高通量序列注释。

BMC Bioinformatics. 2016 Jan 20;17:43. doi: 10.1186/s12859-016-0887-y.

本文引用的文献

Unraveling the Complexities of Life Sciences Data.揭开生命科学数据的复杂性面纱。

Big Data. 2013 Mar;1(1):42-50. doi: 10.1089/big.2012.1505. Epub 2012 Nov 7.

PoPLAR: Portal for Petascale Lifescience Applications and Research.PoPLAR：大规模生命科学应用和研究门户。

BMC Bioinformatics. 2013;14 Suppl 9(Suppl 9):S3. doi: 10.1186/1471-2105-14-S9-S3. Epub 2013 Jun 28.

ScalaBLAST 2.0: rapid and robust BLAST calculations on multiprocessor systems.ScalaBLAST 2.0：在多处理器系统上快速而强大的 BLAST 计算。

Bioinformatics. 2013 Mar 15;29(6):797-8. doi: 10.1093/bioinformatics/btt013. Epub 2013 Jan 29.

GenBank.GenBank。

Nucleic Acids Res. 2012 Jan;40(Database issue):D48-53. doi: 10.1093/nar/gkr1202. Epub 2011 Dec 5.

MOPED: Model Organism Protein Expression Database.MOPED：模式生物蛋白质表达数据库。

Nucleic Acids Res. 2012 Jan;40(Database issue):D1093-9. doi: 10.1093/nar/gkr1177. Epub 2011 Dec 1.

eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges.eggNOG v3.0：涵盖了 41 个不同分类范围的 1133 个生物体的直系同源物组。

Nucleic Acids Res. 2012 Jan;40(Database issue):D284-9. doi: 10.1093/nar/gkr1060. Epub 2011 Nov 16.

Vaccines of the 21st century and vaccinomics: data-enabled science meets global health to spark collective action for vaccine innovation.21世纪的疫苗与疫苗组学：数据驱动的科学与全球健康相遇，激发疫苗创新的集体行动。

OMICS. 2011 Sep;15(9):523-7. doi: 10.1089/omi.2011.03ed. Epub 2011 Aug 17.

Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.基于对 1000 万蛋白质的全对全 BLAST 对蛋白质进行功能群分类。

OMICS. 2011 Jul-Aug;15(7-8):513-21. doi: 10.1089/omi.2011.0101.

SPIRE: Systematic protein investigative research environment.SPIRE：系统蛋白质研究环境。

J Proteomics. 2011 Dec 10;75(1):122-6. doi: 10.1016/j.jprot.2011.05.009. Epub 2011 May 13.

Creating a buzz about insect genomes.引发对昆虫基因组的热议。

Science. 2011 Mar 18;331(6023):1386. doi: 10.1126/science.331.6023.1386.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。