Bork P, Ouzounis C, Sander C, Scharf M, Schneider R, Sonnhammer E
European Molecular Biology Laboratory, Heidelberg, Germany.
Protein Sci. 1992 Dec;1(12):1677-90. doi: 10.1002/pro.5560011216.
With the completion of the first phase of the European yeast genome sequencing project, the complete DNA sequence of chromosome III of Saccharomyces cerevisiae has become available (Oliver, S. G., et al., 1992, Nature 357, 38-46). We have tested the predictive power of computer sequence analysis of the 176 probable protein products of this chromosome, after exclusion of six problem cases. When the results of database similarity searches are pooled with prior knowledge, a likely function can be assigned to 42% of the proteins, and a predicted three-dimensional structure to a third of these (14% of the total). The function of the remaining 58% remains to be determined. Of these, about one-third have one or more probable transmembrane segments. Among the most interesting proteins with predicted functions are a new member of the type X polymerase family, a transcription factor with an N-terminal DNA-binding domain related to GAL4, a "fork head" DNA-binding domain previously known only in Drosophila and in mammals, and a putative methyltransferase. Our analysis increased the number of known significant sequence similarities on chromosome III by 13, to now 67. Although the near 40% success rate of identifying unknown protein function by sequence analysis is surprisingly high, the information gap between known protein sequences and unknown function is expected to widen and become a major bottleneck of genome projects in the near future. Based on the experience gained in this test study, we suggest that the development of an automated computer workbench for protein sequence analysis must be an important item in genome projects.
随着欧洲酵母基因组测序项目第一阶段的完成,酿酒酵母第三条染色体的完整DNA序列已可得(奥利弗,S.G.等人,1992年,《自然》357卷,38 - 46页)。在排除六个有问题的案例后,我们测试了对这条染色体上176个可能的蛋白质产物进行计算机序列分析的预测能力。当将数据库相似性搜索结果与先验知识汇总时,42%的蛋白质可被赋予可能的功能,其中三分之一(占总数的14%)可预测其三维结构。其余58%蛋白质的功能仍有待确定。其中约三分之一有一个或多个可能的跨膜区段。在具有预测功能的最有趣的蛋白质中,有X型聚合酶家族的一个新成员、一个N端DNA结合结构域与GAL4相关的转录因子、一个此前仅在果蝇和哺乳动物中已知的“叉头”DNA结合结构域以及一个推定的甲基转移酶。我们的分析使第三条染色体上已知的显著序列相似性数量增加了13个,达到现在的67个。尽管通过序列分析识别未知蛋白质功能近40%的成功率高得出人意料,但已知蛋白质序列与未知功能之间的信息差距预计在不久的将来会扩大,并成为基因组项目的一个主要瓶颈。基于在这项测试研究中获得的经验,我们建议开发一个用于蛋白质序列分析的自动化计算机工作台必须成为基因组项目中的一项重要内容。