Department of Statistics, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong.
Bioinformatics. 2011 Jul 1;27(13):1772-9. doi: 10.1093/bioinformatics/btr287. Epub 2011 May 6.
Repeats detection problems are traditionally formulated as string matching or signal processing problems. They cannot readily handle gaps between repeat units and are incapable of detecting repeat patterns shared by multiple sequences. This study detects short adjacent repeats with interunit insertions from multiple sequences. For biological sequences, such studies can shed light on molecular structure, biological function and evolution.
The task of detecting short adjacent repeats is formulated as a statistical inference problem by using a probabilistic generative model. An Markov chain Monte Carlo algorithm is proposed to infer the parameters in a de novo fashion. Its applications on synthetic and real biological data show that the new method not only has a competitive edge over existing methods, but also can provide a way to study the structure and the evolution of repeat-containing genes.
The related C++ source code and datasets are available at http://ihome.cuhk.edu.hk/%7Eb118998/share/BASARD.zip.
重复检测问题传统上被表述为字符串匹配或信号处理问题。它们不能很好地处理重复单元之间的间隙,也无法检测多个序列共享的重复模式。本研究从多个序列中检测具有单元间插入的短相邻重复。对于生物序列,此类研究可以揭示分子结构、生物功能和进化。
通过使用概率生成模型,将检测短相邻重复的任务表述为统计推断问题。提出了一种马尔可夫链蒙特卡罗算法来以全新的方式推断参数。它在合成和真实生物数据上的应用表明,新方法不仅比现有方法具有竞争优势,而且还可以提供一种研究重复基因结构和进化的方法。
相关的 C++源代码和数据集可在 http://ihome.cuhk.edu.hk/%7Eb118998/share/BASARD.zip 上获得。