使用隐马尔可夫模型进行DNA序列分析的贝叶斯碱基识别

Bayesian basecalling for DNA sequence analysis using hidden Markov models.

作者信息

Liang Kuo-Ching, Wang Xiaodong, Anastassiou Dimitris

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2007 Jul-Sep;4(3):430-440. doi: 10.1109/tcbb.2007.1027.

DOI:10.1109/tcbb.2007.1027

Abstract

It has been shown that electropherograms of DNA sequences can be modeled with hidden Markov models. Basecalling, the procedure that determines the sequence of bases from the given eletropherogram, can then be performed using the Viterbi algorithm. A training step is required prior to basecalling in order to estimate the HMM parameters. In this paper, we propose a Bayesian approach which employs the Markov chain Monte Carlo (MCMC) method to perform basecalling. Such an approach not only allows one to naturally encode the prior biological knowledge into the basecalling algorithm, it also exploits both the training data and the basecalling data in estimating the HMM parameters, leading to more accurate estimates. Using the recently sequenced genome of the organism Legionella pneumophila we show that the MCMC basecaller outperforms the state-of-the-art basecalling algorithm in terms of total errors while requiring much less training than other proposed statistical basecallers.

摘要

研究表明，DNA序列的电泳图可以用隐马尔可夫模型进行建模。碱基识别是从给定的电泳图确定碱基序列的过程，然后可以使用维特比算法来执行。在进行碱基识别之前需要一个训练步骤，以便估计隐马尔可夫模型参数。在本文中，我们提出了一种贝叶斯方法，该方法采用马尔可夫链蒙特卡罗（MCMC）方法来进行碱基识别。这种方法不仅允许人们将先验生物学知识自然地编码到碱基识别算法中，还在估计隐马尔可夫模型参数时利用了训练数据和碱基识别数据，从而得到更准确的估计。使用嗜肺军团菌最近测序的基因组，我们表明，MCMC碱基识别器在总错误方面优于当前最先进的碱基识别算法，同时与其他提出的统计碱基识别器相比，所需的训练要少得多。