Suppr超能文献

基于对 1000 万蛋白质的全对全 BLAST 对蛋白质进行功能群分类。

Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

机构信息

High-Throughput Analysis Core, Seattle Children's Research Institute, Washington, USA.

出版信息

OMICS. 2011 Jul-Aug;15(7-8):513-21. doi: 10.1089/omi.2011.0101.

Abstract

To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.

摘要

为了解决将数百万个测序蛋白质赋予功能这一巨大挑战,我们使用 BLAST 完成了首个针对 UniRef100 数据库中 990 万个蛋白质的全对全序列比对。微软的 Windows Azure 使用 475 个 8 核虚拟机在 6 天内生成了超过 30 亿条过滤记录。然后使用 Hive 和在 Apache Hadoop 之上实现的自定义罐子将蛋白质分类为功能组,利用 MapReduce 范例。首先,使用直系同源基因(COG)数据库,确定长度归一化位得分(LNBS)是蛋白质分类的最佳相似性度量。LNBS 的灵敏度和特异性均达到 98%。其次,在 510 万个细菌蛋白质中,约三分之二被分配到显著扩展的 COG 组,包含的分配蛋白质数量增加了 30 倍。第三,使用单链接算法的创新实现,将剩余的蛋白质分类为蛋白质功能组,在内部的 Hadoop 计算群集上。这种实现大大减少了非索引查询的运行时间,并优化了大规模的高效聚类。在 Amazon Elastic MapReduce 上也验证了这种性能。这种聚类将近 200 万个蛋白质分配到大约 50 万个不同的功能组。类似的方法也应用于分类 280 万个真核序列,导致超过 100 万个蛋白质被分配到现有的 KOG 组,其余的聚类到 100,000 个功能组。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验