Enright Anton J, Kunin Victor, Ouzounis Christos A
Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK.
Nucleic Acids Res. 2003 Aug 1;31(15):4632-8. doi: 10.1093/nar/gkg495.
Accurate detection of protein families allows assignment of protein function and the analysis of functional diversity in complete genomes. Recently, we presented a novel algorithm called TribeMCL for the detection of protein families that is both accurate and efficient. This method allows family analysis to be carried out on a very large scale. Using TribeMCL, we have generated a resource called TRIBES that contains protein family information, comprising annotations, protein sequence alignments and phylogenetic distributions describing 311 257 proteins from 83 completely sequenced genomes. The analysis of at least 60 934 detected protein families reveals that, with the essential families excluded, paralogy levels are similar between prokaryotes, irrespective of genome size. The number of essential families is estimated to be between 366 and 426. We also show that the currently known space of protein families is scale free and discuss the implications of this distribution. In addition, we show that smaller families are often formed by shorter proteins and discuss the reasons for this intriguing pattern. Finally, we analyse the functional diversity of protein families in entire genome sequences. The TRIBES protein family resource is accessible at http://www.ebi.ac.uk/research/cgg/tribes/.
准确检测蛋白质家族有助于确定蛋白质功能,并分析完整基因组中的功能多样性。最近,我们提出了一种名为TribeMCL的新型算法,用于检测蛋白质家族,该算法既准确又高效。这种方法使得能够在非常大规模上进行家族分析。使用TribeMCL,我们生成了一个名为TRIBES的资源,其中包含蛋白质家族信息,包括注释、蛋白质序列比对以及描述来自83个完全测序基因组的311257个蛋白质的系统发育分布。对至少60934个检测到的蛋白质家族的分析表明,排除必需家族后,原核生物中的旁系同源水平相似,与基因组大小无关。必需家族的数量估计在3�6到426之间。我们还表明,目前已知的蛋白质家族空间是无标度的,并讨论了这种分布的含义。此外,我们表明较小的家族通常由较短的蛋白质形成,并讨论了这种有趣模式的原因。最后,我们分析了整个基因组序列中蛋白质家族的功能多样性。TRIBES蛋白质家族资源可在http://www.ebi.ac.uk/research/cgg/tribes/获取。