Audic S, Claverie J M
Structural and Genetic Information Laboratory, Centre National de la Recherche Scientifique-EP.91, 31 rue Joseph Aiguier, Marseille F-13402, France.
Proc Natl Acad Sci U S A. 1998 Aug 18;95(17):10026-31. doi: 10.1073/pnas.95.17.10026.
A new method for predicting protein-coding regions in microbial genomic DNA sequences is presented. It uses an ab initio iterative Markov modeling procedure to automatically perform the partition of genomic sequences into three subsets shown to correspond to coding, coding on the opposite strand, and noncoding segments. In contrast to current methods, such as GENEMARK [Borodovsky, M. & McIninch, J. D. (1993) Comput. Chem. 17, 123-133], no training set or prior knowledge of the statistical properties of the studied genome are required. This new method tolerates error rates of 1-2% and can process unassembled sequences. It is thus ideal for the analysis of genome survey and/or fragmented sequence data from uncharacterized microorganisms. The method was validated on 10 complete bacterial genomes (from four major phylogenetic lineages). The results show that protein-coding regions can be identified with an accuracy of up to 90% with a totally automated and objective procedure.
本文提出了一种预测微生物基因组DNA序列中蛋白质编码区的新方法。它使用从头开始的迭代马尔可夫建模程序,自动将基因组序列划分为三个子集,分别对应于编码区、反向链编码区和非编码区。与当前方法(如GENEMARK [博罗多夫斯基,M. & 麦金奇,J. D. (1993) 计算机化学17, 123 - 133])不同,该方法不需要训练集或对所研究基因组统计特性的先验知识。这种新方法能够容忍1 - 2%的错误率,并且可以处理未组装的序列。因此,它非常适合分析来自未表征微生物的基因组调查和/或片段化序列数据。该方法在10个完整的细菌基因组(来自四个主要系统发育谱系)上进行了验证。结果表明,通过完全自动化和客观的程序,可以以高达90%的准确率识别蛋白质编码区。