用于结构基因组学的蛋白质家族聚类

Protein family clustering for structural genomics.

作者信息

Yan Yongpan, Moult John

机构信息

Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, 9600 Gudelsky Drive, Rockville, MD 20850, USA.

出版信息

J Mol Biol. 2005 Oct 28;353(3):744-59. doi: 10.1016/j.jmb.2005.08.058. Epub 2005 Sep 9.

DOI:10.1016/j.jmb.2005.08.058

PMID:16185712

Abstract

A major goal of structural genomics is the provision of a structural template for a large fraction of protein domains. The magnitude of this task depends on the number and nature of protein sequence families. With a large number of bacterial genomes now fully sequenced, it is possible to obtain improved estimates of the number and diversity of families in that kingdom. We have used an automated clustering procedure to group all sequences in a set of genomes into protein families. Bench-marking shows the clustering method is sensitive at detecting remote family members, and has a low level of false positives. This comprehensive protein family set has been used to address the following questions. (1) What is the structure coverage for currently known families? (2) How will the number of known apparent families grow as more genomes are sequenced? (3) What is a practical strategy for maximizing structure coverage in future? Our study indicates that approximately 20% of known families with three or more members currently have a representative structure. The study indicates also that the number of apparent protein families will be considerably larger than previously thought: We estimate that, by the criteria of this work, there will be about 250,000 protein families when 1000 microbial genomes have been sequenced. However, the vast majority of these families will be small, and it will be possible to obtain structural templates for 70-80% of protein domains with an achievable number of representative structures, by systematically sampling the larger families.

摘要

结构基因组学的一个主要目标是为大部分蛋白质结构域提供一个结构模板。这项任务的规模取决于蛋白质序列家族的数量和性质。随着大量细菌基因组现已完全测序，有可能对该领域家族的数量和多样性获得更准确的估计。我们使用了一种自动聚类程序，将一组基因组中的所有序列分组为蛋白质家族。基准测试表明，该聚类方法在检测远亲家族成员方面很敏感，且假阳性水平较低。这个全面的蛋白质家族集已被用于解决以下问题。（1）目前已知家族的结构覆盖率是多少？（2）随着更多基因组被测序，已知明显家族的数量将如何增长？（3）未来最大化结构覆盖率的实际策略是什么？我们的研究表明，目前约20%有三个或更多成员的已知家族有代表性结构。该研究还表明，明显的蛋白质家族数量将比以前认为的大得多：我们估计，按照这项工作的标准，当1000个微生物基因组被测序时，将有大约250,000个蛋白质家族。然而，这些家族中的绝大多数将很小，通过系统地对较大的家族进行采样，有可能用可实现数量的代表性结构获得70 - 80%蛋白质结构域的结构模板。