Kunin Victor, Teichmann Sarah A, Huynen Martijn A, Ouzounis Christos A
Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK.
Bioinformatics. 2005 Jun 1;21(11):2618-22. doi: 10.1093/bioinformatics/bti386. Epub 2005 Mar 15.
Databases of protein families often exhibit drastically different properties of the protein family space.
We compared the properties of protein family space as reflected by exhaustive protein family databases and databases with predefined families. We used TRIBES, Protomap, ProDom and COGs as representatives of the exhaustive databases, and Pfam-A and Superfamily as databases that predefine families. We observe a power-law distribution of family sizes in all these databases, albeit in predefined databases the power-law line collapses before reaching smaller sized families. We discuss the future trends of this power-law distribution and suggest that saturation in the sampling of protein family space will result in a distortion of the power law in small family sizes. For larger genome sizes, predefined databases show logarithmic growth of the number of families per genome, whereas exhaustive databases exhibit a virtually linear relationship. All databases consistently differ in the proportion of protein families shared between taxa. Predefined databases have a larger number of protein families shared between the three domains of life, while exhaustive databases show a much more fragmented distribution. We argue that these discrepancies reflect alternative approaches to the trade-off issue of sensitivity versus specificity in the detection of homologous proteins. We conclude that these properties are complementary rather than contradictory, while describing the protein universe from different perspectives.
蛋白质家族数据库通常呈现出蛋白质家族空间截然不同的特性。
我们比较了详尽的蛋白质家族数据库和具有预定义家族的数据库所反映的蛋白质家族空间特性。我们使用TRIBES、Protomap、ProDom和COGs作为详尽数据库的代表,以及Pfam - A和Superfamily作为预定义家族的数据库。我们观察到所有这些数据库中家族大小的幂律分布,尽管在预定义数据库中幂律线在达到较小家族之前就会崩溃。我们讨论了这种幂律分布的未来趋势,并表明蛋白质家族空间采样的饱和将导致小家族大小幂律的扭曲。对于更大的基因组大小,预定义数据库显示每个基因组家族数量的对数增长,而详尽数据库呈现出几乎线性的关系。所有数据库在不同分类群之间共享的蛋白质家族比例上始终存在差异。预定义数据库在生命的三个域之间共享的蛋白质家族数量更多,而详尽数据库显示出更加分散的分布。我们认为这些差异反映了在检测同源蛋白质时敏感性与特异性权衡问题的不同方法。我们得出结论,这些特性是互补而非矛盾的,同时从不同角度描述了蛋白质世界。