Rezaei Vahid, Pezeshk Hamid, Pérez-Sa'nchez Horacio
Department of Mathematics and Statistics, Faculty of Financial Science, University of Economic Sciences, Tehran, Iran ; School of Computer Science, Institute for Research in Fundamental Science (IPM), Tehran, Iran.
School of Computer Science, Institute for Research in Fundamental Science (IPM), Tehran, Iran ; School of Mathematics, Statistics and Computer Science, University of Tehran, Iran.
PLoS One. 2013 Dec 20;8(12):e80565. doi: 10.1371/journal.pone.0080565. eCollection 2013.
The profile hidden Markov model (PHMM) is widely used to assign the protein sequences to their respective families. A major limitation of a PHMM is the assumption that given states the observations (amino acids) are independent. To overcome this limitation, the dependency between amino acids in a multiple sequence alignment (MSA) which is the representative of a PHMM can be appended to the PHMM. Due to the fact that with a MSA, the sequences of amino acids are biologically related, the one-by-one dependency between two amino acids can be considered. In other words, based on the MSA, the dependency between an amino acid and its corresponding amino acid located above can be combined with the PHMM. For this purpose, the new emission probability matrix which considers the one-by-one dependencies between amino acids is constructed. The parameters of a PHMM are of two types; transition and emission probabilities which are usually estimated using an EM algorithm called the Baum-Welch algorithm. We have generalized the Baum-Welch algorithm using similarity emission matrix constructed by integrating the new emission probability matrix with the common emission probability matrix. Then, the performance of similarity emission is discussed by applying it to the top twenty protein families in the Pfam database. We show that using the similarity emission in the Baum-Welch algorithm significantly outperforms the common Baum-Welch algorithm in the task of assigning protein sequences to protein families.
轮廓隐马尔可夫模型(PHMM)被广泛用于将蛋白质序列归类到各自的家族中。PHMM的一个主要局限在于假设给定状态下的观测值(氨基酸)是相互独立的。为克服这一局限,可以将作为PHMM代表的多序列比对(MSA)中氨基酸之间的依赖性附加到PHMM上。由于在MSA中,氨基酸序列具有生物学相关性,因此可以考虑两个氨基酸之间的逐一依赖性。换句话说,基于MSA,可以将一个氨基酸与其上方相应氨基酸之间的依赖性与PHMM相结合。为此,构建了考虑氨基酸之间逐一依赖性的新发射概率矩阵。PHMM的参数有两种类型:转移概率和发射概率,通常使用一种称为鲍姆-韦尔奇算法的期望最大化(EM)算法进行估计。我们通过将新发射概率矩阵与普通发射概率矩阵整合构建相似性发射矩阵,对鲍姆-韦尔奇算法进行了推广。然后,将相似性发射应用于Pfam数据库中的前二十个蛋白质家族,讨论其性能。我们表明,在将蛋白质序列归类到蛋白质家族的任务中,在鲍姆-韦尔奇算法中使用相似性发射明显优于普通的鲍姆-韦尔奇算法。