基因组序列空间中的蛋白质家族与部落

Protein families and TRIBES in genome sequence space.

作者信息

Enright Anton J, Kunin Victor, Ouzounis Christos A

机构信息

Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK.

出版信息

Nucleic Acids Res. 2003 Aug 1;31(15):4632-8. doi: 10.1093/nar/gkg495.

DOI:10.1093/nar/gkg495

PMID:12888524

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC169885/

Abstract

Accurate detection of protein families allows assignment of protein function and the analysis of functional diversity in complete genomes. Recently, we presented a novel algorithm called TribeMCL for the detection of protein families that is both accurate and efficient. This method allows family analysis to be carried out on a very large scale. Using TribeMCL, we have generated a resource called TRIBES that contains protein family information, comprising annotations, protein sequence alignments and phylogenetic distributions describing 311 257 proteins from 83 completely sequenced genomes. The analysis of at least 60 934 detected protein families reveals that, with the essential families excluded, paralogy levels are similar between prokaryotes, irrespective of genome size. The number of essential families is estimated to be between 366 and 426. We also show that the currently known space of protein families is scale free and discuss the implications of this distribution. In addition, we show that smaller families are often formed by shorter proteins and discuss the reasons for this intriguing pattern. Finally, we analyse the functional diversity of protein families in entire genome sequences. The TRIBES protein family resource is accessible at http://www.ebi.ac.uk/research/cgg/tribes/.

摘要

准确检测蛋白质家族有助于确定蛋白质功能，并分析完整基因组中的功能多样性。最近，我们提出了一种名为TribeMCL的新型算法，用于检测蛋白质家族，该算法既准确又高效。这种方法使得能够在非常大规模上进行家族分析。使用TribeMCL，我们生成了一个名为TRIBES的资源，其中包含蛋白质家族信息，包括注释、蛋白质序列比对以及描述来自83个完全测序基因组的311257个蛋白质的系统发育分布。对至少60934个检测到的蛋白质家族的分析表明，排除必需家族后，原核生物中的旁系同源水平相似，与基因组大小无关。必需家族的数量估计在3�6到426之间。我们还表明，目前已知的蛋白质家族空间是无标度的，并讨论了这种分布的含义。此外，我们表明较小的家族通常由较短的蛋白质形成，并讨论了这种有趣模式的原因。最后，我们分析了整个基因组序列中蛋白质家族的功能多样性。TRIBES蛋白质家族资源可在http://www.ebi.ac.uk/research/cgg/tribes/获取。

相似文献

Protein families and TRIBES in genome sequence space.

Nucleic Acids Res. 2003 Aug 1;31(15):4632-8. doi: 10.1093/nar/gkg495.

On the quality of tree-based protein classification.

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

Automatic annotation of protein function based on family identification.

Proteins. 2003 Nov 15;53(3):683-92. doi: 10.1002/prot.10449.

Identification and distribution of protein families in 120 completed genomes using Gene3D.

Proteins. 2005 May 15;59(3):603-15. doi: 10.1002/prot.20409.

Probing metagenomics by rapid cluster analysis of very large datasets.

PLoS One. 2008;3(10):e3375. doi: 10.1371/journal.pone.0003375. Epub 2008 Oct 10.

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource.

BMC Bioinformatics. 2012 Oct 13;13:264. doi: 10.1186/1471-2105-13-264.

DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins.

Bioinformatics. 1998;14(2):144-50. doi: 10.1093/bioinformatics/14.2.144.

The SYSTERS Protein Family Database in 2005.

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D226-9. doi: 10.1093/nar/gki030.

SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale.

BMC Bioinformatics. 2010 Mar 9;11:120. doi: 10.1186/1471-2105-11-120.

Clustering of proximal sequence space for the identification of protein families.

Bioinformatics. 2002 Jul;18(7):908-21. doi: 10.1093/bioinformatics/18.7.908.

引用本文的文献

zol and fai: large-scale targeted detection and evolutionary investigation of gene clusters.

Nucleic Acids Res. 2025 Jan 24;53(3). doi: 10.1093/nar/gkaf045.

CGG toolkit: Software components for computational genomics.

PLoS Comput Biol. 2023 Nov 7;19(11):e1011498. doi: 10.1371/journal.pcbi.1011498. eCollection 2023 Nov.

zol & fai: large-scale targeted detection and evolutionary investigation of gene clusters.

bioRxiv. 2024 Sep 12:2023.06.07.544063. doi: 10.1101/2023.06.07.544063.

MolluscDB: a genome and transcriptome database for molluscs.

Philos Trans R Soc Lond B Biol Sci. 2021 May 24;376(1825):20200157. doi: 10.1098/rstb.2020.0157. Epub 2021 Apr 5.

Ab Initio Construction and Evolutionary Analysis of Protein-Coding Gene Families with Partially Homologous Relationships: Closely Related Drosophila Genomes as a Case Study.

Genome Biol Evol. 2020 Mar 1;12(3):185-202. doi: 10.1093/gbe/evaa041.

Extensive chromosomal rearrangements and rapid evolution of novel effector superfamilies contribute to host adaptation and speciation in the basal ascomycetous fungi.

Mol Plant Pathol. 2020 Mar;21(3):330-348. doi: 10.1111/mpp.12899. Epub 2020 Jan 8.

The scale-free nature of protein sequence space.

PLoS One. 2018 Aug 1;13(8):e0200815. doi: 10.1371/journal.pone.0200815. eCollection 2018.

No wisdom in the crowd: genome annotation in the era of big data - current status and future prospects.

Microb Biotechnol. 2018 Jul;11(4):588-605. doi: 10.1111/1751-7915.13284. Epub 2018 May 28.

Percolation in protein sequence space.

PLoS One. 2017 Dec 20;12(12):e0189646. doi: 10.1371/journal.pone.0189646. eCollection 2017.

KinFin: Software for Taxon-Aware Analysis of Clustered Protein Sequences.

G3 (Bethesda). 2017 Oct 5;7(10):3349-3357. doi: 10.1534/g3.117.300233.

本文引用的文献

COmplete GENome Tracking (COGENT): a flexible data environment for computational genomics.

Bioinformatics. 2003 Jul 22;19(11):1451-2. doi: 10.1093/bioinformatics/btg161.

GeneTRACE-reconstruction of gene content of ancestral species.

Bioinformatics. 2003 Jul 22;19(11):1412-6. doi: 10.1093/bioinformatics/btg174.

Myriads of protein families, and still counting.

Genome Biol. 2003;4(2):401. doi: 10.1186/gb-2003-4-2-401. Epub 2003 Jan 28.

Domains, motifs and clusters in the protein universe.

Curr Opin Chem Biol. 2003 Feb;7(1):5-11. doi: 10.1016/s1367-5931(02)00003-0.

Studying genomes through the aeons: protein families, pseudogenes and proteome evolution.

J Mol Biol. 2002 May 17;318(5):1155-74. doi: 10.1016/s0022-2836(02)00109-2.

An efficient algorithm for large-scale detection of protein families.

Nucleic Acids Res. 2002 Apr 1;30(7):1575-84. doi: 10.1093/nar/30.7.1575.

SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein.

Nucleic Acids Res. 2002 Jan 1;30(1):299-300. doi: 10.1093/nar/30.1.299.

The Pfam protein families database.

Nucleic Acids Res. 2002 Jan 1;30(1):276-80. doi: 10.1093/nar/30.1.276.

Mining the draft human genome.

Nature. 2001 Feb 15;409(6822):827-8. doi: 10.1038/35057004.

CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins.

Nucleic Acids Res. 2001 Jan 1;29(1):33-6. doi: 10.1093/nar/29.1.33.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基因组序列空间中的蛋白质家族与部落

Protein families and TRIBES in genome sequence space.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献