Suppr
超能文献

EVEREST：对所有蛋白质序列中的蛋白质结构域进行自动识别和分类。

EVEREST: automatic identification and classification of protein domains in all protein sequences.

作者信息

Portugaly Elon, Harel Amir, Linial Nathan, Linial Michal

机构信息

School of Computer Science & Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.

出版信息

BMC Bioinformatics. 2006 Jun 2;7:277. doi: 10.1186/1471-2105-7-277.

DOI:10.1186/1471-2105-7-277

PMID:16749920

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1533870/

Abstract

BACKGROUND

Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again.

RESULTS

Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains.

CONCLUSION

The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at 1, provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site.

摘要

背景

蛋白质由一个或多个构建模块组成，这些构建模块被称为结构域。根据其进化起源，这些结构域可被分类为家族。尽管近年来测序技术有了巨大进步，但对于大规模确定蛋白质结构域及其边界，却没有匹配的计算方法。我们提供并严格评估了一组从序列数据中自动生成的新型结构域家族。我们的结构域家族识别过程称为EVEREST（重复片段的进化集成），首先构建一个在全对全成对序列比较中出现的蛋白质片段库。然后将这些片段聚类为假定的结构域家族。使用机器学习技术选择最佳的假定家族。接着为每个选定的家族创建一个统计模型。然后重复这个过程：使用上述统计模型扫描所有蛋白质序列，重新创建片段库并再次对其进行聚类。

结果

处理UniProt知识库（版本7.2）的Swiss-Prot部分时，EVEREST定义了20230个结构域，覆盖了Swiss-Prot数据库中85%的氨基酸。EVEREST注释了11852个未被Pfam A注释的蛋白质（占数据库的6%）。此外，在43086个蛋白质（占数据库的20%）中，EVEREST注释了Pfam A未注释的蛋白质部分。性能测试表明，EVEREST以高精度恢复了56%的Pfam A家族和63%的SCOP家族，并以至少51%的保真度提出了以前未知的结构域家族。EVEREST结构域通常是由Pfam或SCOP定义的结构域的组合，并且经常是这些结构域的子结构域。

结论

EVEREST过程及其输出的结构域家族提供了一个从序列数据自动生成的、详尽且经过验证的蛋白质结构域世界视图。可在1处浏览和下载的EVEREST结构域家族库，为其他现有库提供了补充视图。此外，由于它是自动的，EVEREST过程具有可扩展性，我们未来也将在更大的数据库上运行它。EVEREST源文件可从EVEREST网站下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ccf/1533870/97f8a5ae9d11/1471-2105-7-277-1.jpg

相似文献

EVEREST: automatic identification and classification of protein domains in all protein sequences.

BMC Bioinformatics. 2006 Jun 2;7:277. doi: 10.1186/1471-2105-7-277.

EVEREST: a collection of evolutionary conserved protein domains.

Nucleic Acids Res. 2007 Jan;35(Database issue):D241-6. doi: 10.1093/nar/gkl850. Epub 2006 Nov 11.

Exhaustive enumeration of protein domain families.

J Mol Biol. 2003 May 2;328(3):749-67. doi: 10.1016/s0022-2836(03)00269-9.

ProClust: improved clustering of protein sequences with an extended graph-based approach.

Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.

Pfam: The protein families database in 2021.

Nucleic Acids Res. 2021 Jan 8;49(D1):D412-D419. doi: 10.1093/nar/gkaa913.

More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology.

PLoS Comput Biol. 2010 Jul 29;6(7):e1000867. doi: 10.1371/journal.pcbi.1000867.

AutoSCOP: automated prediction of SCOP classifications using unique pattern-class mappings.

Bioinformatics. 2007 May 15;23(10):1203-10. doi: 10.1093/bioinformatics/btm089. Epub 2007 Mar 22.

Identification of Protein Homologs and Domain Boundaries by Iterative Sequence Alignment.

Methods Mol Biol. 2019;1851:277-286. doi: 10.1007/978-1-4939-8736-8_15.

SUPFAM: a database of sequence superfamilies of protein domains.

BMC Bioinformatics. 2004 Mar 15;5:28. doi: 10.1186/1471-2105-5-28.

SUPFAM--a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes.

Nucleic Acids Res. 2002 Jan 1;30(1):289-93. doi: 10.1093/nar/30.1.289.

引用本文的文献

DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets.

PLoS Comput Biol. 2022 Oct 19;18(10):e1010610. doi: 10.1371/journal.pcbi.1010610. eCollection 2022 Oct.

Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation.

BMC Bioinformatics. 2021 Mar 12;22(1):121. doi: 10.1186/s12859-021-04013-x.

DeepDom: Predicting protein domain boundary from sequence alone using stacked bidirectional LSTM.

Pac Symp Biocomput. 2019;24:66-75.

Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing.

BMC Bioinformatics. 2018 Mar 5;19(1):83. doi: 10.1186/s12859-018-2080-y.

ThreaDomEx: a unified platform for predicting continuous and discontinuous protein domains by multiple-threading and segment assembly.

Nucleic Acids Res. 2017 Jul 3;45(W1):W400-W407. doi: 10.1093/nar/gkx410.

A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions.

PLoS One. 2016 Aug 23;11(8):e0161338. doi: 10.1371/journal.pone.0161338. eCollection 2016.

Extending Protein Domain Boundary Predictors to Detect Discontinuous Domains.

PLoS One. 2015 Oct 26;10(10):e0141541. doi: 10.1371/journal.pone.0141541. eCollection 2015.

A pluralistic account of homology: adapting the models to the data.

Mol Biol Evol. 2014 Mar;31(3):501-16. doi: 10.1093/molbev/mst228. Epub 2013 Nov 22.

ThreaDom: extracting protein domain boundary information from multiple threading alignments.

Bioinformatics. 2013 Jul 1;29(13):i247-56. doi: 10.1093/bioinformatics/btt209.

Functional inference by ProtoNet family tree: the uncharacterized proteome of Daphnia pulex.

BMC Bioinformatics. 2013;14 Suppl 3(Suppl 3):S11. doi: 10.1186/1471-2105-14-S3-S11. Epub 2013 Feb 28.

本文引用的文献

The Universal Protein Resource (UniProt): an expanding universe of protein information.

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D187-91. doi: 10.1093/nar/gkj161.

A functional hierarchical organization of the protein sequence space.

BMC Bioinformatics. 2004 Dec 14;5:196. doi: 10.1186/1471-2105-5-196.

A robust method to detect structural and functional remote homologues.

Proteins. 2004 Nov 15;57(3):531-8. doi: 10.1002/prot.20235.

CHOP: parsing proteins into structural domains.

Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W569-71. doi: 10.1093/nar/gkh481.

Automatic prediction of protein domains from sequence information using a hybrid learning system.

Bioinformatics. 2004 Jun 12;20(9):1335-60. doi: 10.1093/bioinformatics/bth086. Epub 2004 Feb 12.

Protein structure prediction via combinatorial assembly of sub-structural units.

Bioinformatics. 2003;19 Suppl 1:i158-68. doi: 10.1093/bioinformatics/btg1020.

Exhaustive enumeration of protein domain families.

J Mol Biol. 2003 May 2;328(3):749-67. doi: 10.1016/s0022-2836(03)00269-9.

Domains, motifs and clusters in the protein universe.

Curr Opin Chem Biol. 2003 Feb;7(1):5-11. doi: 10.1016/s1367-5931(02)00003-0.

The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.

Nucleic Acids Res. 2003 Jan 1;31(1):365-70. doi: 10.1093/nar/gkg095.

ProtoNet: hierarchical classification of the protein space.

Nucleic Acids Res. 2003 Jan 1;31(1):348-52. doi: 10.1093/nar/gkg096.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

EVEREST：对所有蛋白质序列中的蛋白质结构域进行自动识别和分类。

EVEREST: automatic identification and classification of protein domains in all protein sequences.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译