从蛋白质结构角度比较基因组：有限元件列表的调查

Comparing genomes in terms of protein structure: surveys of a finite parts list.

作者信息

Gerstein M, Hegyi H

机构信息

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.

出版信息

FEMS Microbiol Rev. 1998 Oct;22(4):277-304. doi: 10.1111/j.1574-6976.1998.tb00371.x.

DOI:10.1111/j.1574-6976.1998.tb00371.x

PMID:10357579

Abstract

We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g., analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into 'fold families.' This library can be built up automatically using a structure comparison program, and we described how important objective statistical measures are for assessing similarities within the library and between the library and genome sequences. After building the library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn diagrams and 'top-10' statistics for shared and common folds. Depending on the counting methodology employed, these statistics can reflect different aspects of the genome, such as the amount of internal duplication or gene expression. Previous analyses have shown that the common folds shared between very different microorganisms, i.e., in different kingdoms, have a remarkably similar structure, being comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this sort of 'fold-counting' is that only a small subset of the structures in a complete genome are currently known and this subset is prone to sampling bias. One way of overcoming biases is through structure prediction, which can be applied uniformly and comprehensively to a whole genome. Various investigators have, in fact, already applied many of the existing techniques for predicting secondary structure and transmembrane (TM) helices to the recently sequenced genomes. The results have been consistent: microbial genomes have similar fractions of strands and helices even though they have significantly different amino acid composition. The fraction of membrane proteins with a given number of TM helices falls off rapidly with more TM elements, approximately according to a Zipf law. This latter finding indicates that there is no preference for the highly studied 7-TM proteins in microbial genomes. Continuously updated tables and further information pertinent to this review are available over the web at http://bioinfo.mbb.yale.edu/genome.

摘要

我们概述了结构基因组学这一新兴领域，描述了如何根据蛋白质结构对基因组进行比较。由于基因组中的基因数量和蛋白质折叠的总数都相当有限，这些比较采取了对有限部件清单进行调查的形式，在某些方面类似于人口普查。折叠调查与其他全基因组特征分析有许多相似之处，例如对基序或途径的分析。然而，结构具有一些使其特别适合比较基因组的方面，即它允许对基本蛋白质模块进行精确定义的方式，以及它与序列相似性的关系比蛋白质功能的关系定义得更好这一事实。结构调查的一个基本要求是一个折叠库，它将已知结构分组为“折叠家族”。这个库可以使用结构比较程序自动构建，我们描述了客观统计测量对于评估库内以及库与基因组序列之间的相似性是多么重要。构建库之后，可以用它来计算基因组中的折叠数量，以维恩图和共享及常见折叠的“前十”统计数据的形式表达结果。根据所采用的计数方法，这些统计数据可以反映基因组的不同方面，例如内部重复的数量或基因表达。先前的分析表明，非常不同的微生物（即不同王国中的微生物）之间共享的常见折叠具有非常相似的结构，由重复的链 - 螺旋 - 链超二级结构单元组成。这种“折叠计数”的一个主要困难是，目前完整基因组中只有一小部分结构是已知的，并且这个子集容易受到抽样偏差的影响。克服偏差的一种方法是通过结构预测，它可以统一且全面地应用于整个基因组。事实上，许多研究人员已经将现有的许多预测二级结构和跨膜（TM）螺旋的技术应用于最近测序的基因组。结果是一致的：微生物基因组中链和螺旋的比例相似，尽管它们的氨基酸组成有很大差异。具有给定数量TM螺旋的膜蛋白比例随着TM元件数量的增加而迅速下降，大致符合齐普夫定律。后一个发现表明，微生物基因组中对研究较多的7 - TM蛋白没有偏好。与本综述相关的不断更新的表格和更多信息可通过网页http://bioinfo.mbb.yale.edu/genome获取。