Rudd K E, Humphery-Smith I, Wasinger V C, Bairoch A
Department of Biochemistry and Molecular Biology, University of Miami School of Medicine, FL 33101-6129, USA.
Electrophoresis. 1998 Apr;19(4):536-44. doi: 10.1002/elps.1150190413.
The EcoGene project involves the examination of Escherichia coli K-12 DNA sequences and accompanying annotation in the public databases in order to refine the representation and prediction of the entire set of E. coli K-12 chromosomally encoded protein sequences. The results of this ongoing effort have been deposited in the SWISSPROT protein sequence database as sequencing of the E. coli genome has progressed to completion in recent years. Through this continuing research, we have discovered that the prediction of low molecular weight (small) proteins, arbitrarily defined as protein sequences < or = 150 amino acids (aa) in length, is problematic and requires special attention. We describe the small protein subset of EcoGene and the approach used to derive this subset from the complete E. coli genome sequence and database annotations. These E. coli proteins have helped to identify new small genes in other organisms and to identify conserved residues (motifs) using database searches and multiple alignments. Two thirds of the E. coli small proteins have not been characterized experimentally. The careful application of computer and laboratory methods to the analysis of small proteins is needed for accurate prediction, verification and characterization. The problem of accurate protein sequence identification is not limited to small proteins or to E. coli; these problems are encountered to varying degrees throughout all sequence databases.
EcoGene项目涉及对公共数据库中大肠杆菌K-12的DNA序列及相关注释进行研究,以便优化对大肠杆菌K-12染色体编码的整个蛋白质序列集的表示和预测。近年来,随着大肠杆菌基因组测序工作的逐步完成,这项正在进行的研究结果已存入SWISSPROT蛋白质序列数据库。通过这项持续的研究,我们发现对低分子量(小)蛋白质(任意定义为长度小于或等于150个氨基酸(aa)的蛋白质序列)的预测存在问题,需要特别关注。我们描述了EcoGene中的小蛋白质子集以及从完整的大肠杆菌基因组序列和数据库注释中获取该子集的方法。这些大肠杆菌蛋白质有助于在其他生物体中鉴定新的小基因,并通过数据库搜索和多重比对来鉴定保守残基(基序)。三分之二的大肠杆菌小蛋白质尚未经过实验表征。为了进行准确的预测、验证和表征,需要谨慎应用计算机和实验室方法来分析小蛋白质。准确的蛋白质序列鉴定问题不仅限于小蛋白质或大肠杆菌;在所有序列数据库中都会不同程度地遇到这些问题。