Wang Jun, Keightley Peter D, Johnson Toby
Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK.
BMC Bioinformatics. 2006 Jun 8;7:292. doi: 10.1186/1471-2105-7-292.
Non-coding DNA sequences comprise a very large proportion of the total genomic content of mammals, most other vertebrates, many invertebrates, and most plants. Unraveling the functional significance of non-coding DNA depends on how well we are able to align non-coding DNA sequences. However, the alignment of non-coding DNA sequences is more difficult than aligning protein-coding sequences.
Here we present an improved pair-hidden-Markov-Model (pair HMM) based method for performing global pairwise alignment of non-coding DNA sequences. The method uses an explicit model of indel length frequency distribution which can be specified, and allows any time reversible model of nucleotide substitution. The method uses a deterministic global optimiser to find the alignment with the highest posterior probability. We test MCALIGN2 in simulations, and compare it to a previous Monte Carlo based method (MCALIGN), to the pair HMM method of Knudsen and Miyamoto, and to a heuristic method (AVID) that performed very well in a previous simulation study. We show that the pair HMM methods have excellent performance for all combinations of parameter values we have considered. MCALIGN2 is up to ten times faster than MCALIGN. MCALIGN2 is more accurate in resolving indels given an accurate explicit model than heuristic methods, but is computationally slower.
MCALIGN2 produces better quality alignments by explicitly using biological knowledge about the indel length distribution and time reversible models of nucleotide substitution. As a result, it can outperform other available sequence alignment methods for the cases we have considered to align non-coding DNA sequences.
非编码DNA序列在哺乳动物、大多数其他脊椎动物、许多无脊椎动物和大多数植物的基因组总含量中占很大比例。揭示非编码DNA的功能意义取决于我们对非编码DNA序列进行比对的能力。然而,非编码DNA序列的比对比蛋白质编码序列的比对更困难。
在此,我们提出一种基于改进的双隐马尔可夫模型(pair HMM)的方法,用于对非编码DNA序列进行全局成对比对。该方法使用可指定的插入缺失长度频率分布的显式模型,并允许核苷酸替换的任何时间可逆模型。该方法使用确定性全局优化器来找到后验概率最高的比对。我们在模拟中测试了MCALIGN2,并将其与先前基于蒙特卡洛的方法(MCALIGN)、Knudsen和Miyamoto的双隐马尔可夫模型方法以及在先前模拟研究中表现出色的启发式方法(AVID)进行比较。我们表明,对于我们考虑的所有参数值组合,双隐马尔可夫模型方法都具有出色的性能。MCALIGN2比MCALIGN快十倍。在给定准确的显式模型时,MCALIGN2在解析插入缺失方面比启发式方法更准确,但计算速度较慢。
MCALIGN2通过明确使用关于插入缺失长度分布和核苷酸替换时间可逆模型的生物学知识,产生质量更高的比对。因此,对于我们考虑的比对非编码DNA序列的情况,它可以优于其他可用的序列比对方法。