Gerstein M
Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, USA.
Fold Des. 1998;3(6):497-512. doi: 10.1016/S1359-0278(98)00066-2.
Determining how representative the known structures are of the proteins encoded by a complete genome is important for assessing to what extent our current picture of protein stability and folding is overly influenced by biases in the structure databank (PDB). It is also important for improving database-based methods of structure prediction and genome annotation.
The known structures are compared to the proteins encoded by eight complete microbial genomes in terms of simple statistics such as sequence length, composition and secondary structure. The known structures are represented by a collection of nonhomologous domains from the PDB and a smaller list of 'biophysical proteins' on which folding experiments have concentrated. The proteins encoded by the genomes are considered as a whole and divided into various regions, such as known-structure homologue, low complexity (nonglobular), transmembrane or linker. Various tests are performed to assess the significance of the reported differences, in both a practical and a statistical sense.
The proteins encoded by the genomes are significantly different from those in the PDB. Their sequence lengths, which follow an extreme value distribution, are longer than the PDB proteins and much longer than the biophysical proteins. Their composition differs from the PDB proteins in having more Lys, Ile, Asn and Gln and less Cys and Trp. This is true overall and especially for the regions corresponding to soluble proteins of as yet unknown fold. Secondary-structure prediction on these uncharacterized regions indicates that they contain on average more helical structure than the PDB; differences about this mean are small, with yeast having slightly more sheet structure and Haemophilus influenzae and Helicobacter pylori more helical structure. Further information is available through the GeneCensus system at http://bioinfo.mbb.yale.edu/genome.
确定完整基因组所编码蛋白质的已知结构具有多大代表性,对于评估我们目前关于蛋白质稳定性和折叠的认知在多大程度上受到结构数据库(PDB)偏差的过度影响至关重要。这对于改进基于数据库的结构预测方法和基因组注释也很重要。
根据诸如序列长度、组成和二级结构等简单统计数据,将已知结构与八个完整微生物基因组所编码的蛋白质进行比较。已知结构由来自PDB的一组非同源结构域以及折叠实验所集中研究的较少的“生物物理蛋白质”列表来表示。基因组所编码的蛋白质被作为一个整体来考虑,并被划分为不同区域,如已知结构同源物、低复杂性(非球状)、跨膜或连接子区域。进行了各种测试以评估所报告差异在实际和统计意义上的显著性。
基因组所编码的蛋白质与PDB中的蛋白质有显著差异。它们的序列长度遵循极值分布,比PDB蛋白质更长,比生物物理蛋白质长得多。它们的组成与PDB蛋白质不同,含有更多的赖氨酸、异亮氨酸、天冬酰胺和谷氨酰胺,而半胱氨酸和色氨酸较少。总体而言是这样,特别是对于对应于尚未知晓折叠方式的可溶性蛋白质的区域。对这些未表征区域的二级结构预测表明,它们平均比PDB含有更多的螺旋结构;围绕这个平均值的差异较小,酵母含有稍多的片状结构,而流感嗜血杆菌和幽门螺杆菌含有更多的螺旋结构。可通过http://bioinfo.mbb.yale.edu/genome的GeneCensus系统获取更多信息。