Frishman D, Mironov A, Mewes H W, Gelfand M
Munich Information Center for Protein Sequences (MIPS) of the German National Center for Health and Environment (GSF), Am Klopferspitz 18a, 82152 Martinsried, Germany.
Nucleic Acids Res. 1998 Jun 15;26(12):2941-7. doi: 10.1093/nar/26.12.2941.
Analysis of a newly sequenced bacterial genome starts with identification of protein-coding genes. Functional assignment of proteins requires the exact knowledge of protein N-termini. We present a new program ORPHEUS that identifies candidate genes and accurately predicts gene starts. The analysis starts with a database similarity search and identification of reliable gene fragments. The latter are used to derive statistical characteristics of protein-coding regions and ribosome-binding sites and to predict the complete set of genes in the analyzed genome. In a test on Bacillus subtilis and Escherichia coli genomes, the program correctly identified 93.3% (resp. 96.3%) of experimentally annotated genes longer than 100 codons described in the PIR-International database, and for these genes 96.3% (83.9%) of starts were predicted exactly. Furthermore, 98.9% (99.1%) of genes longer than 100 codons annotated in GenBank were found, and 92.9% (75.7%) of predicted starts coincided with the feature table description. Finally, for the complete gene complements of B.subtilis and E.coli , including genes shorter than 100 codons, gene prediction accuracy was 88.9 and 87.1%, respectively, with 94.2 and 76.7% starts coinciding with the existing annotation.
对新测序的细菌基因组进行分析始于蛋白质编码基因的识别。蛋白质的功能分配需要准确了解蛋白质的N端。我们提出了一个新程序ORPHEUS,它可以识别候选基因并准确预测基因起始位点。分析从数据库相似性搜索和可靠基因片段的识别开始。后者用于推导蛋白质编码区和核糖体结合位点的统计特征,并预测分析基因组中的完整基因集。在对枯草芽孢杆菌和大肠杆菌基因组的测试中,该程序正确识别了PIR国际数据库中描述的93.3%(分别为96.3%)长度超过100个密码子的实验注释基因,对于这些基因,96.3%(83.9%)的起始位点被准确预测。此外,发现了GenBank中注释的98.9%(99.1%)长度超过100个密码子的基因,92.9%(75.7%)的预测起始位点与特征表描述一致。最后,对于枯草芽孢杆菌和大肠杆菌的完整基因互补体,包括长度小于100个密码子的基因,基因预测准确率分别为88.9%和87.1%,94.2%和76.7%的起始位点与现有注释一致。