Azad Rajeev K, Borodovsky Mark
School of Biology and School of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0230, USA.
Brief Bioinform. 2004 Jun;5(2):118-30. doi: 10.1093/bib/5.2.118.
In this paper, we review developments in probabilistic methods of gene recognition in prokaryotic genomes with the emphasis on connections to the general theory of hidden Markov models (HMM). We show that the Bayesian method implemented in GeneMark, a frequently used gene-finding tool, can be augmented and reintroduced as a rigorous forward-backward (FB) algorithm for local posterior decoding described in the HMM theory. Another earlier developed method, prokaryotic GeneMark.hmm, uses a modification of the Viterbi algorithm for HMM with duration to identify the most likely global path through hidden functional states given the DNA sequence. GeneMark and GeneMark.hmm programs are worth using in concert for analysing prokaryotic DNA sequences that arguably do not follow any exact mathematical model. The new extension of GeneMark using the FB algorithm was implemented in the software program GeneMark.fba. Given the DNA sequence, this program determines an a posteriori probability for each nucleotide to belong to coding or non-coding region. Also, for any open reading frame (ORF), it assigns a score defined as a probabilistic measure of all paths through hidden states that traverse the ORF as a coding region. The prediction accuracy of GeneMark.fba determined in our tests was compared favourably to the accuracy of the initial (standard) GeneMark program. Comparison to the prokaryotic GeneMark.hmm has also demonstrated a certain, yet species-specific, degree of improvement in raw gene detection, ie detection of correct reading frame (and stop codon). The accuracy of exact gene prediction, which is concerned about precise prediction of gene start (which in a prokaryotic genome unambiguously defines the reading frame and stop codon, thus, the whole protein product), still remains more accurate in GeneMarkS, which uses more elaborate HMM to specifically address this task.
在本文中,我们回顾了原核生物基因组中基因识别概率方法的发展,重点是与隐马尔可夫模型(HMM)一般理论的联系。我们表明,常用的基因发现工具GeneMark中实现的贝叶斯方法可以扩展并重新引入,作为HMM理论中描述的用于局部后验解码的严格前向-后向(FB)算法。另一种早期开发的方法,原核生物GeneMark.hmm,使用了一种针对具有持续时间的HMM的维特比算法的修改版本,以在给定DNA序列的情况下识别通过隐藏功能状态的最可能全局路径。GeneMark和GeneMark.hmm程序值得协同使用,以分析可能不遵循任何精确数学模型的原核生物DNA序列。使用FB算法的GeneMark新扩展在软件程序GeneMark.fba中实现。给定DNA序列,该程序确定每个核苷酸属于编码或非编码区域的后验概率。此外,对于任何开放阅读框(ORF),它会分配一个分数,该分数定义为通过将ORF作为编码区域遍历的隐藏状态的所有路径的概率度量。在我们的测试中确定的GeneMark.fba的预测准确性与初始(标准)GeneMark程序的准确性相比具有优势。与原核生物GeneMark.hmm的比较也表明,在原始基因检测方面,即正确阅读框(和终止密码子)的检测,有一定程度的、但物种特异性的提高。精确基因预测的准确性,即关注基因起始的精确预测(在原核生物基因组中,基因起始明确地定义了阅读框和终止密码子,从而定义了整个蛋白质产物),在使用更精细的HMM专门解决此任务的GeneMarkS中仍然更准确。