Joint Center for Structural Genomics, Bioinformatics Core, Burnham Institute for Medical Research, La Jolla, California, United States of America.
PLoS Biol. 2009 Sep;7(9):e1000205. doi: 10.1371/journal.pbio.1000205. Epub 2009 Sep 29.
The genome projects have unearthed an enormous diversity of genes of unknown function that are still awaiting biological and biochemical characterization. These genes, as most others, can be grouped into families based on sequence similarity. The PFAM database currently contains over 2,200 such families, referred to as domains of unknown function (DUF). In a coordinated effort, the four large-scale centers of the NIH Protein Structure Initiative have determined the first three-dimensional structures for more than 250 of these DUF families. Analysis of the first 248 reveals that about two thirds of the DUF families likely represent very divergent branches of already known and well-characterized families, which allows hypotheses to be formulated about their biological function. The remainder can be formally categorized as new folds, although about one third of these show significant substructure similarity to previously characterized folds. These results infer that, despite the enormous increase in the number and the diversity of new genes being uncovered, the fold space of the proteins they encode is gradually becoming saturated. The previously unexplored sectors of the protein universe appear to be primarily shaped by extreme diversification of known protein families, which then enables organisms to evolve new functions and adapt to particular niches and habitats. Notwithstanding, these DUF families still constitute the richest source for discovery of the remaining protein folds and topologies.
基因组计划揭示了大量未知功能的基因多样性,这些基因仍有待进行生物学和生物化学特征分析。这些基因和大多数其他基因一样,可以根据序列相似性分为家族。PFAM 数据库目前包含超过 2200 个这样的家族,被称为未知功能结构域 (DUF)。在一项协调一致的努力下,NIH 蛋白质结构计划的四个大型中心已经确定了 250 多个 DUF 家族的前三个三维结构。对前 248 个的分析表明,大约三分之二的 DUF 家族可能代表已经已知和充分描述的家族的非常不同的分支,这使得可以对其生物学功能提出假设。其余的可以正式归类为新的折叠,但其中约三分之一显示出与以前描述的折叠有显著的亚结构相似性。这些结果推断,尽管新发现的基因数量和多样性有了巨大的增加,但它们编码的蛋白质的折叠空间逐渐饱和。蛋白质宇宙中以前未被探索的部分似乎主要是由已知蛋白质家族的极端多样化形成的,这使得生物体能够进化出新的功能并适应特定的小生境和栖息地。尽管如此,这些 DUF 家族仍然是发现剩余蛋白质折叠和拓扑结构的最丰富的来源。