Krogh A
Center for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark.
Proc Int Conf Intell Syst Mol Biol. 1997;5:179-86.
A hidden Markov model for gene finding consists of submodels for coding regions, splice sites, introns, intergenic regions and possibly more. It is described how to estimate the model as a whole from labeled sequences instead of estimating the individual parts independently from subsequences. It is argued that the standard maximum likelihood estimation criterion is not optimal for training such a model. Instead of maximizing the probability of the DNA sequence, one should maximize the probability of the correct prediction. Such a criterion, called conditional maximum likelihood, is used for the gene finder 'HMM-gene'. A new (approximative) algorithm is described, which finds the most probable prediction summed over all paths yielding the same prediction. We show that these methods contribute significantly to the high performance of HMMgene.
用于基因识别的隐马尔可夫模型由编码区、剪接位点、内含子、基因间区域等子模型组成,可能还包括更多。本文描述了如何从标记序列整体估计该模型,而不是从子序列独立估计各个部分。文中指出,标准的最大似然估计准则对于训练这样一个模型并非最优。不应最大化DNA序列的概率,而应最大化正确预测的概率。这种称为条件最大似然的准则被用于基因识别工具“HMM-gene”。本文描述了一种新的(近似)算法,该算法能找到在所有产生相同预测的路径上求和得到的最可能预测。我们表明,这些方法对HMMgene的高性能有显著贡献。