Suppr超能文献

SECOM:一种基于新型哈希种子和社区检测的全基因组蛋白质结构域识别方法。

SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.

机构信息

Mathematical and Computer Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.

出版信息

PLoS One. 2012;7(6):e39475. doi: 10.1371/journal.pone.0039475. Epub 2012 Jun 28.

Abstract

With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx.

摘要

随着 DNA 测序技术的快速发展,从各种生物体中产生了大量高通量的基因组和蛋白质组数据。蛋白质的功能注释和进化历史通常是从基因组序列预测的结构域中推断出来的。然而,传统的基于数据库的结构域预测方法无法识别新的结构域,而基于比对的方法则在蛋白质组中寻找重复的片段,计算量很大。在这里,我们提出了一种新的全基因组结构域预测方法 SECOM。SECOM 不是进行所有对所有的序列比对,而是首先使用哈希种子函数对基因组中的所有蛋白质进行索引。这样就可以检测到局部相似性,并将其编码成图结构,其中每个节点代表一个蛋白质序列,每个边权重代表两个节点之间共享的哈希种子。SECOM 然后将结构域预测问题表述为这个图中的重叠社区发现问题。提出了一种有效的回溯图渗滤算法来识别结构域。我们在最近测序的五种水生动物基因组上测试了 SECOM。我们的测试表明,SECOM 能够识别出 InterProScan 识别的大多数已知结构域。与基于比对的方法相比,SECOM 在检测假定的新结构域方面具有更高的灵敏度,同时速度也快三个数量级。例如,SECOM 能够预测到一种新的海绵特异性三磷酸核苷酶 (NTPases) 结构域。此外,SECOM 还发现了两个可能具有细菌起源的新结构域,它们在分类上仅限于海葵和水螅。SECOM 是一个开源程序,可在 http://sfb.kaust.edu.sa/Pages/Software.aspx 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1c60/3386278/ebc5afe2c2d6/pone.0039475.g003.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验