Suppr超能文献

EVEREST:对所有蛋白质序列中的蛋白质结构域进行自动识别和分类。

EVEREST: automatic identification and classification of protein domains in all protein sequences.

作者信息

Portugaly Elon, Harel Amir, Linial Nathan, Linial Michal

机构信息

School of Computer Science & Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.

出版信息

BMC Bioinformatics. 2006 Jun 2;7:277. doi: 10.1186/1471-2105-7-277.

Abstract

BACKGROUND

Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again.

RESULTS

Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains.

CONCLUSION

The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at 1, provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site.

摘要

背景

蛋白质由一个或多个构建模块组成,这些构建模块被称为结构域。根据其进化起源,这些结构域可被分类为家族。尽管近年来测序技术有了巨大进步,但对于大规模确定蛋白质结构域及其边界,却没有匹配的计算方法。我们提供并严格评估了一组从序列数据中自动生成的新型结构域家族。我们的结构域家族识别过程称为EVEREST(重复片段的进化集成),首先构建一个在全对全成对序列比较中出现的蛋白质片段库。然后将这些片段聚类为假定的结构域家族。使用机器学习技术选择最佳的假定家族。接着为每个选定的家族创建一个统计模型。然后重复这个过程:使用上述统计模型扫描所有蛋白质序列,重新创建片段库并再次对其进行聚类。

结果

处理UniProt知识库(版本7.2)的Swiss-Prot部分时,EVEREST定义了20230个结构域,覆盖了Swiss-Prot数据库中85%的氨基酸。EVEREST注释了11852个未被Pfam A注释的蛋白质(占数据库的6%)。此外,在43086个蛋白质(占数据库的20%)中,EVEREST注释了Pfam A未注释的蛋白质部分。性能测试表明,EVEREST以高精度恢复了56%的Pfam A家族和63%的SCOP家族,并以至少51%的保真度提出了以前未知的结构域家族。EVEREST结构域通常是由Pfam或SCOP定义的结构域的组合,并且经常是这些结构域的子结构域。

结论

EVEREST过程及其输出的结构域家族提供了一个从序列数据自动生成的、详尽且经过验证的蛋白质结构域世界视图。可在1处浏览和下载的EVEREST结构域家族库,为其他现有库提供了补充视图。此外,由于它是自动的,EVEREST过程具有可扩展性,我们未来也将在更大的数据库上运行它。EVEREST源文件可从EVEREST网站下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ccf/1533870/97f8a5ae9d11/1471-2105-7-277-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验