如何解读匿名细菌基因组：用于基因识别的机器学习方法

How to interpret an anonymous bacterial genome: machine learning approach to gene identification.

作者信息

Hayes W S, Borodovsky M

机构信息

School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30332-0230, USA.

出版信息

Genome Res. 1998 Nov;8(11):1154-71. doi: 10.1101/gr.8.11.1154.

DOI:10.1101/gr.8.11.1154

PMID:9847079

Abstract

In this report we address the problem of accurate statistical modeling of DNA sequences, either coding or noncoding, for a bacterial species whose genome (or a large portion) was sequenced but not yet characterized experimentally. Availability of these models is critical for successful solution of the genome annotation task by statistical methods of gene finding. We present the method, GeneMark-Genesis, which learns the parameters of Markov models of protein-coding and noncoding regions from anonymous bacterial genomic sequence. These models are subsequently used in the GeneMark and GeneMark.hmm gene-finding programs. Although there is basically one model of a noncoding region for a given genome, several models of protein-coding region are automatically obtained by GeneMark-Genesis. The diversity of protein-coding models reflects the diversity of oligonucleotide compositions, particularly the diversity of codon usage strategies observed in genes from one and the same genome. In the simplest and the most important case, there are just two gene models-typical and atypical ones. We show that the atypical model allows one to predict genes that escape identification by the typical model. Many genes predicted by the atypical model appear to be horizontally transferred genes. The early versions of GeneMark-Genesis were used for annotating the genomes of Methanoccocus jannaschii and Helicobacter pylori. We report the results of accuracy testing of the full-scale version of GeneMark-Genesis on 10 completely sequenced bacterial genomes. Interestingly, the GeneMark.hmm program that employed the typical and atypical models defined by GeneMark-Genesis was able to predict 683 new atypical genes with 176 of them confirmed by similarity search.

摘要

在本报告中，我们探讨了针对一种细菌物种的DNA序列（无论是编码序列还是非编码序列）进行精确统计建模的问题，该细菌物种的基因组（或其大部分）已被测序，但尚未通过实验进行表征。这些模型的可用性对于通过基因发现的统计方法成功解决基因组注释任务至关重要。我们提出了GeneMark-Genesis方法，该方法从匿名细菌基因组序列中学习蛋白质编码区和非编码区的马尔可夫模型参数。这些模型随后被用于GeneMark和GeneMark.hmm基因发现程序。尽管对于给定的基因组基本上只有一个非编码区模型，但GeneMark-Genesis会自动获得几个蛋白质编码区模型。蛋白质编码模型的多样性反映了寡核苷酸组成的多样性，特别是在同一基因组的基因中观察到的密码子使用策略的多样性。在最简单且最重要的情况下，只有两种基因模型——典型模型和非典型模型。我们表明，非典型模型能够预测那些无法被典型模型识别的基因。许多由非典型模型预测的基因似乎是水平转移基因。GeneMark-Genesis的早期版本曾用于注释詹氏甲烷球菌和幽门螺杆菌的基因组。我们报告了GeneMark-Genesis完整版在10个完全测序的细菌基因组上的准确性测试结果。有趣的是，采用由GeneMark-Genesis定义的典型和非典型模型的GeneMark.hmm程序能够预测683个新的非典型基因，其中176个通过相似性搜索得到了证实。