Nánási Michal, Vinař Tomáš, Brejová Broňa
Department of Computer Science, Faculty of Mathematics, Physics, and Informatics, Comenius University, Mlynská dolina, 842 48 Bratislava, Slovakia.
Algorithms Mol Biol. 2014 Mar 1;9(1):3. doi: 10.1186/1748-7188-9-3.
Short tandem repeats are ubiquitous in genomic sequences and due to their complex evolutionary history pose a challenge for sequence alignment tools.
To better account for the presence of tandem repeats in pairwise sequence alignments, we propose a simple tractable pair hidden Markov model that explicitly models their presence. Using the framework of gain functions, we design several optimization criteria for decoding this model and describe resulting decoding algorithms, ranging from the traditional Viterbi and posterior decoding to block-based decoding algorithms tailored to our model. We compare the accuracy of individual decoding algorithms on simulated and real data and find that our approach is superior to the classical three-state pair HMM.
Our study illustrates versatility of pair hidden Markov models coupled with appropriate decoding criteria as a modeling tool for capturing complex sequence features.
短串联重复序列在基因组序列中普遍存在,由于其复杂的进化历史,给序列比对工具带来了挑战。
为了在双序列比对中更好地考虑串联重复序列的存在,我们提出了一个简单易处理的配对隐马尔可夫模型,该模型明确地对其存在进行建模。利用增益函数框架,我们设计了几种用于解码此模型的优化标准,并描述了由此产生的解码算法,从传统的维特比解码和后验解码到针对我们模型定制的基于块的解码算法。我们在模拟数据和真实数据上比较了各个解码算法的准确性,发现我们的方法优于经典的三状态配对隐马尔可夫模型。
我们的研究说明了配对隐马尔可夫模型与适当的解码标准相结合作为捕获复杂序列特征的建模工具的通用性。