Department of Information Engineering, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
Nucleic Acids Res. 2012 Oct;40(19):e147. doi: 10.1093/nar/gks644. Epub 2012 Jun 29.
Tandem repeats occur frequently in biological sequences. They are important for studying genome evolution and human disease. A number of methods have been designed to detect a single tandem repeat in a sliding window. In this article, we focus on the case that an unknown number of tandem repeat segments of the same pattern are dispersively distributed in a sequence. We construct a probabilistic generative model for the tandem repeats, where the sequence pattern is represented by a motif matrix. A Bayesian approach is adopted to compute this model. Markov chain Monte Carlo (MCMC) algorithms are used to explore the posterior distribution as an effort to infer both the motif matrix of tandem repeats and the location of repeat segments. Reversible jump Markov chain Monte Carlo (RJMCMC) algorithms are used to address the transdimensional model selection problem raised by the variable number of repeat segments. Experiments on both synthetic data and real data show that this new approach is powerful in detecting dispersed short tandem repeats. As far as we know, it is the first work to adopt RJMCMC algorithms in the detection of tandem repeats.
串联重复在生物序列中频繁出现。它们对于研究基因组进化和人类疾病非常重要。已经设计了许多方法来在滑动窗口中检测单个串联重复。在本文中,我们专注于一个序列中分散分布的相同模式的未知数量的串联重复片段的情况。我们构建了一个串联重复的概率生成模型,其中序列模式由基序矩阵表示。采用贝叶斯方法来计算这个模型。马尔可夫链蒙特卡罗(MCMC)算法被用来探索后验分布,以推断串联重复的基序矩阵和重复片段的位置。可逆跳跃马尔可夫链蒙特卡罗(RJMCMC)算法被用来解决由重复片段数量可变引起的跨维模型选择问题。在合成数据和真实数据上的实验表明,这种新方法在检测分散的短串联重复方面非常有效。据我们所知,这是第一个采用 RJMCMC 算法检测串联重复的工作。