Wistrand Markus, Sonnhammer Erik L L
Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden.
J Comput Biol. 2004;11(1):181-93. doi: 10.1089/106652704773416957.
Insertions and deletions in a profile hidden Markov model (HMM) are modeled by transition probabilities between insert, delete and match states. These are estimated by combining observed data and prior probabilities. The transition prior probabilities can be defined either ad hoc or by maximum likelihood (ML) estimation. We show that the choice of transition prior greatly affects the HMM's ability to discriminate between true and false hits. HMM discrimination was measured using the HMMER 2.2 package applied to 373 families from Pfam. We measured the discrimination between true members and noise sequences employing various ML transition priors and also systematically scanned the parameter space of ad hoc transition priors. Our results indicate that ML priors produce far from optimal discrimination, and we present an empirically derived prior that considerably decreases the number of misclassifications compared to ML. Most of the difference stems from the probabilities for exiting a delete state. The ML prior, which is unaware of noise sequences, estimates a delete-to-delete probability that is relatively high and does not penalize noise sequences enough for optimal discrimination.
在隐马尔可夫模型(HMM)中,插入和缺失是通过插入、删除和匹配状态之间的转移概率来建模的。这些概率是通过结合观测数据和先验概率来估计的。转移先验概率既可以临时定义,也可以通过最大似然(ML)估计来定义。我们表明,转移先验的选择极大地影响了HMM区分真实命中和错误命中的能力。使用应用于Pfam中373个家族的HMMER 2.2软件包来测量HMM的区分能力。我们使用各种ML转移先验来测量真实成员与噪声序列之间的区分能力,并系统地扫描临时转移先验的参数空间。我们的结果表明,ML先验产生的区分能力远非最优,并且我们提出了一种根据经验得出的先验,与ML相比,它大大减少了错误分类的数量。大部分差异源于退出删除状态的概率。不考虑噪声序列的ML先验估计出的删除到删除概率相对较高,并且对噪声序列的惩罚不足以实现最优区分。