Suppr超能文献

优化用于蛋白质功能注释的高性能计算工作流程。

Optimizing high performance computing workflow for protein functional annotation.

作者信息

Stanberry Larissa, Rekepalli Bhanu, Liu Yuan, Giblock Paul, Higdon Roger, Montague Elizabeth, Broomall William, Kolker Natali, Kolker Eugene

机构信息

Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA.

Joint Institute for Computational Sciences, University of Tennessee - Oak Ridge National Laboratory (JICS UT - ORNL), DELSA Global, Oak Ridge, TN, USA.

出版信息

Concurr Comput. 2014 Sep 10;26(13):2112-2121. doi: 10.1002/cpe.3264.

Abstract

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

摘要

新测序基因组的功能注释是现代生物学的主要挑战之一。借助现代测序技术,蛋白质序列库正在迅速扩展。仅新测序的细菌基因组就包含超过750万种蛋白质。数据生成速度远远超过了蛋白质注释的速度。蛋白质数据量使得人工编目变得不可行,而高计算成本限制了现有自动化方法的实用性。在这项工作中,我们提出了一种改进和优化的自动化工作流程,以实现大规模蛋白质注释。该工作流程使用高性能计算架构和低复杂度分类算法,将蛋白质分配到现有的直系同源蛋白质组簇中。基于位置特异性迭代基本局部比对搜索工具,该算法确保所得分类的特异性和灵敏度至少为80%。该工作流程利用高度可扩展的并行应用程序进行分类和序列比对。使用极端科学与工程发现环境超级计算机,该工作流程处理了120万个新测序的细菌蛋白质。随着蛋白质序列库的迅速扩展,所提出的工作流程将使科学家能够注释大型基因组数据。

相似文献

2
3
PoPLAR: Portal for Petascale Lifescience Applications and Research.PoPLAR:大规模生命科学应用和研究门户。
BMC Bioinformatics. 2013;14 Suppl 9(Suppl 9):S3. doi: 10.1186/1471-2105-14-S9-S3. Epub 2013 Jun 28.
7
COGNIZER: A Framework for Functional Annotation of Metagenomic Datasets.认知器:宏基因组数据集功能注释框架
PLoS One. 2015 Nov 11;10(11):e0142102. doi: 10.1371/journal.pone.0142102. eCollection 2015.

本文引用的文献

1
Unraveling the Complexities of Life Sciences Data.揭开生命科学数据的复杂性面纱。
Big Data. 2013 Mar;1(1):42-50. doi: 10.1089/big.2012.1505. Epub 2012 Nov 7.
2
PoPLAR: Portal for Petascale Lifescience Applications and Research.PoPLAR:大规模生命科学应用和研究门户。
BMC Bioinformatics. 2013;14 Suppl 9(Suppl 9):S3. doi: 10.1186/1471-2105-14-S9-S3. Epub 2013 Jun 28.
4
GenBank.GenBank。
Nucleic Acids Res. 2012 Jan;40(Database issue):D48-53. doi: 10.1093/nar/gkr1202. Epub 2011 Dec 5.
5
MOPED: Model Organism Protein Expression Database.MOPED:模式生物蛋白质表达数据库。
Nucleic Acids Res. 2012 Jan;40(Database issue):D1093-9. doi: 10.1093/nar/gkr1177. Epub 2011 Dec 1.
9
SPIRE: Systematic protein investigative research environment.SPIRE:系统蛋白质研究环境。
J Proteomics. 2011 Dec 10;75(1):122-6. doi: 10.1016/j.jprot.2011.05.009. Epub 2011 May 13.
10
Creating a buzz about insect genomes.引发对昆虫基因组的热议。
Science. 2011 Mar 18;331(6023):1386. doi: 10.1126/science.331.6023.1386.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验