Stanberry Larissa, Rekepalli Bhanu, Liu Yuan, Giblock Paul, Higdon Roger, Montague Elizabeth, Broomall William, Kolker Natali, Kolker Eugene
Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA.
Joint Institute for Computational Sciences, University of Tennessee - Oak Ridge National Laboratory (JICS UT - ORNL), DELSA Global, Oak Ridge, TN, USA.
Concurr Comput. 2014 Sep 10;26(13):2112-2121. doi: 10.1002/cpe.3264.
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
新测序基因组的功能注释是现代生物学的主要挑战之一。借助现代测序技术,蛋白质序列库正在迅速扩展。仅新测序的细菌基因组就包含超过750万种蛋白质。数据生成速度远远超过了蛋白质注释的速度。蛋白质数据量使得人工编目变得不可行,而高计算成本限制了现有自动化方法的实用性。在这项工作中,我们提出了一种改进和优化的自动化工作流程,以实现大规模蛋白质注释。该工作流程使用高性能计算架构和低复杂度分类算法,将蛋白质分配到现有的直系同源蛋白质组簇中。基于位置特异性迭代基本局部比对搜索工具,该算法确保所得分类的特异性和灵敏度至少为80%。该工作流程利用高度可扩展的并行应用程序进行分类和序列比对。使用极端科学与工程发现环境超级计算机,该工作流程处理了120万个新测序的细菌蛋白质。随着蛋白质序列库的迅速扩展,所提出的工作流程将使科学家能够注释大型基因组数据。