Arribas-Gil Ana, Metzler Dirk, Plouhinec Jean-Louis
Departamento de Estadística, Universidad Carlos III de Madrid, Getafe, Spain.
IEEE/ACM Trans Comput Biol Bioinform. 2009 Apr-Jun;6(2):281-95. doi: 10.1109/TCBB.2007.70246.
We present a stochastic sequence evolution model to obtain alignments and estimate mutation rates between two homologous sequences. The model allows two possible evolutionary behaviors along a DNA sequence in order to determine conserved regions and take its heterogeneity into account. In our model, the sequence is divided into slow and fast evolution regions. The boundaries between these sections are not known. It is our aim to detect them. The evolution model is based on a fragment insertion and deletion process working on fast regions only and on a substitution process working on fast and slow regions with different rates. This model induces a pair hidden Markov structure at the level of alignments, thus making efficient statistical alignment algorithms possible. We propose two complementary estimation methods, namely, a Gibbs sampler for Bayesian estimation and a stochastic version of the EM algorithm for maximum likelihood estimation. Both algorithms involve the sampling of alignments. We propose a partial alignment sampler, which is computationally less expensive than the typical whole alignment sampler. We show the convergence of the two estimation algorithms when used with this partial sampler. Our algorithms provide consistent estimates for the mutation rates and plausible alignments and sequence segmentations on both simulated and real data.
我们提出了一种随机序列进化模型,用于获得比对结果并估计两个同源序列之间的突变率。该模型允许沿着DNA序列出现两种可能的进化行为,以便确定保守区域并考虑其异质性。在我们的模型中,序列被分为缓慢进化区域和快速进化区域。这些区域之间的边界是未知的。我们的目标是检测它们。进化模型基于仅在快速区域起作用的片段插入和删除过程,以及在快速和缓慢区域以不同速率起作用的替换过程。该模型在比对层面诱导出一对隐藏马尔可夫结构,从而使高效的统计算法成为可能。我们提出了两种互补的估计方法,即用于贝叶斯估计的吉布斯采样器和用于最大似然估计的EM算法的随机版本。两种算法都涉及比对的采样。我们提出了一种部分比对采样器,其计算成本低于典型的全比对采样器。我们展示了这两种估计算法与这种部分采样器一起使用时的收敛性。我们的算法为突变率提供了一致的估计,并在模拟数据和真实数据上给出了合理的比对结果和序列分割。