Walker Megon, Pavlovic Vladimir, Kasif Simon
Bioinformatics Program, Boston University, Boston, MA 02215, USA.
Nucleic Acids Res. 2002 Jul 15;30(14):3181-91. doi: 10.1093/nar/gkf423.
The ever growing number of completely sequenced prokaryotic genomes facilitates cross-species comparisons by genomic annotation algorithms. This paper introduces a new probabilistic framework for comparative genomic analysis and demonstrates its utility in the context of improving the accuracy of prokaryotic gene start site detection. Our frame work employs a product hidden Markov model (PROD-HMM) with state architecture to model the species-specific trinucleotide frequency patterns in sequences immediately upstream and downstream of a translation start site and to detect the contrasting non-synonymous (amino acid changing) and synonymous (silent) substitution rates that differentiate prokaryotic coding from intergenic regions. Depending on the intricacy of the features modeled by the hidden state architecture, intergenic, regulatory, promoter and coding regions can be delimited by this method. The new system is evaluated using a preliminary set of orthologous Pyrococcus gene pairs, for which it demonstrates an improved accuracy of detection. Its robustness is confirmed by analysis with cross-validation of an experimentally verified set of Escherichia coli K-12 and Salmonella thyphimurium LT2 orthologs. The novel architecture has a number of attractive features that distinguish it from previous comparative models such as pair-HMMs.
完全测序的原核生物基因组数量不断增加,这有助于通过基因组注释算法进行跨物种比较。本文介绍了一种用于比较基因组分析的新概率框架,并展示了其在提高原核生物基因起始位点检测准确性方面的实用性。我们的框架采用具有状态结构的乘积隐马尔可夫模型(PROD-HMM),对翻译起始位点上下游序列中物种特异性的三核苷酸频率模式进行建模,并检测区分原核生物编码区和基因间区域的不同非同义(氨基酸变化)和同义(沉默)替换率。根据隐状态结构所建模特征的复杂性,该方法可以界定基因间区域、调控区域、启动子区域和编码区域。使用一组初步的直系同源嗜热栖热菌基因对评估了新系统,结果表明其检测准确性有所提高。通过对一组经实验验证的大肠杆菌K-12和鼠伤寒沙门氏菌LT2直系同源物进行交叉验证分析,证实了其稳健性。这种新颖的结构具有许多吸引人的特征,使其有别于以前的比较模型,如配对隐马尔可夫模型。