基于对 1000 万蛋白质的全对全 BLAST 对蛋白质进行功能群分类。

Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

机构信息

High-Throughput Analysis Core, Seattle Children's Research Institute, Washington, USA.

出版信息

OMICS. 2011 Jul-Aug;15(7-8):513-21. doi: 10.1089/omi.2011.0101.

DOI:10.1089/omi.2011.0101

Abstract

To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.

摘要

为了解决将数百万个测序蛋白质赋予功能这一巨大挑战，我们使用 BLAST 完成了首个针对 UniRef100 数据库中 990 万个蛋白质的全对全序列比对。微软的 Windows Azure 使用 475 个 8 核虚拟机在 6 天内生成了超过 30 亿条过滤记录。然后使用 Hive 和在 Apache Hadoop 之上实现的自定义罐子将蛋白质分类为功能组，利用 MapReduce 范例。首先，使用直系同源基因（COG）数据库，确定长度归一化位得分（LNBS）是蛋白质分类的最佳相似性度量。LNBS 的灵敏度和特异性均达到 98%。其次，在 510 万个细菌蛋白质中，约三分之二被分配到显著扩展的 COG 组，包含的分配蛋白质数量增加了 30 倍。第三，使用单链接算法的创新实现，将剩余的蛋白质分类为蛋白质功能组，在内部的 Hadoop 计算群集上。这种实现大大减少了非索引查询的运行时间，并优化了大规模的高效聚类。在 Amazon Elastic MapReduce 上也验证了这种性能。这种聚类将近 200 万个蛋白质分配到大约 50 万个不同的功能组。类似的方法也应用于分类 280 万个真核序列，导致超过 100 万个蛋白质被分配到现有的 KOG 组，其余的聚类到 100,000 个功能组。

相似文献

Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.基于对 1000 万蛋白质的全对全 BLAST 对蛋白质进行功能群分类。

OMICS. 2011 Jul-Aug;15(7-8):513-21. doi: 10.1089/omi.2011.0101.

On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

The National Center for Biotechnology Information's Protein Clusters Database.美国国立生物技术信息中心的蛋白质簇数据库。

Nucleic Acids Res. 2009 Jan;37(Database issue):D216-23. doi: 10.1093/nar/gkn734. Epub 2008 Oct 21.

ProtoNet 4.0: a hierarchical classification of one million protein sequences.ProtoNet 4.0：一百万个蛋白质序列的层次分类

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D216-8. doi: 10.1093/nar/gki007.

Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores.评估基因组学中的注释转移：通过传统分数和概率分数量化蛋白质序列、结构与功能之间的关系。

J Mol Biol. 2000 Mar 17;297(1):233-49. doi: 10.1006/jmbi.2000.3550.

Cluster-C, an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques.Cluster-C，一种基于最大团提取的蛋白质序列大规模聚类算法。

Comput Biol Chem. 2004 Jul;28(3):211-8. doi: 10.1016/j.compbiolchem.2004.03.002.

Defining the fold space of membrane proteins: the CAMPS database.定义膜蛋白的折叠空间：CAMPS数据库。

Proteins. 2006 Sep 1;64(4):906-22. doi: 10.1002/prot.21081.

Incremental generation of summarized clustering hierarchy for protein family analysis.用于蛋白质家族分析的汇总聚类层次结构的增量生成。

Bioinformatics. 2004 Nov 1;20(16):2586-96. doi: 10.1093/bioinformatics/bth290. Epub 2004 May 6.

Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes.用于多基因组中综合直系同源域分类的层次聚类算法。

Nucleic Acids Res. 2006 Jan 25;34(2):647-58. doi: 10.1093/nar/gkj448. Print 2006.

引用本文的文献

A Graphic Encoding Method for Quantitative Classification of Protein Structure and Representation of Conformational Changes.一种用于蛋白质结构定量分类和构象变化表示的图形编码方法。

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jul-Aug;18(4):1336-1349. doi: 10.1109/TCBB.2019.2945291. Epub 2021 Aug 6.

Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons.Libra：一种基于可扩展 k-mer 的大规模所有与所有宏基因组比较工具。

Gigascience. 2019 Feb 1;8(2):giy165. doi: 10.1093/gigascience/giy165.

Optimizing high performance computing workflow for protein functional annotation.优化用于蛋白质功能注释的高性能计算工作流程。

Concurr Comput. 2014 Sep 10;26(13):2112-2121. doi: 10.1002/cpe.3264.

Opportunities and challenges for the life sciences community.生命科学共同体的机遇与挑战。

OMICS. 2012 Mar;16(3):138-47. doi: 10.1089/omi.2011.0152.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于对 1000 万蛋白质的全对全 BLAST 对蛋白质进行功能群分类。

Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献