Bi Chengpeng
Bioinformatics and Intelligent Computing Lab, Division of Clinical Pharmacology, Children's Mercy Hospitals, Kansas City, Missouri, USA.
BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S13. doi: 10.1186/1471-2105-10-S1-S13.
Deciphering cis-regulatory elements or de novo motif-finding in genomes still remains elusive although much algorithmic effort has been expended. The Markov chain Monte Carlo (MCMC) method such as Gibbs motif samplers has been widely employed to solve the de novo motif-finding problem through sequence local alignment. Nonetheless, the MCMC-based motif samplers still suffer from local maxima like EM. Therefore, as a prerequisite for finding good local alignments, these motif algorithms are often independently run a multitude of times, but without information exchange between different chains. Hence it would be worth a new algorithm design enabling such information exchange.
This paper presents a novel motif-finding algorithm by evolving a population of Markov chains with information exchange (PMC), each of which is initialized as a random alignment and run by the Metropolis-Hastings sampler (MHS). It is progressively updated through a series of local alignments stochastically sampled. Explicitly, the PMC motif algorithm performs stochastic sampling as specified by a population-based proposal distribution rather than individual ones, and adaptively evolves the population as a whole towards a global maximum. The alignment information exchange is accomplished by taking advantage of the pooled motif site distributions. A distinct method for running multiple independent Markov chains (IMC) without information exchange, or dubbed as the IMC motif algorithm, is also devised to compare with its PMC counterpart.
Experimental studies demonstrate that the performance could be improved if pooled information were used to run a population of motif samplers. The new PMC algorithm was able to improve the convergence and outperformed other popular algorithms tested using simulated and biological motif sequences.
尽管已经投入了大量的算法研究,但在基因组中解析顺式调控元件或从头基序发现仍然难以捉摸。马尔可夫链蒙特卡罗(MCMC)方法,如吉布斯基序采样器,已被广泛用于通过序列局部比对来解决从头基序发现问题。然而,基于MCMC的基序采样器仍然像期望最大化(EM)方法一样受到局部最大值的困扰。因此,作为找到良好局部比对的先决条件,这些基序算法通常独立运行多次,但不同链之间没有信息交换。因此,设计一种能够实现这种信息交换的新算法是值得的。
本文提出了一种新颖的基序发现算法,即通过具有信息交换的马尔可夫链群体进化(PMC)算法,其中每个链初始化为随机比对并由梅特罗波利斯-黑斯廷斯采样器(MHS)运行。它通过一系列随机采样的局部比对逐步更新。具体而言,PMC基序算法按照基于群体的提议分布而不是个体提议分布进行随机采样,并使整个群体朝着全局最大值自适应进化。比对信息交换是通过利用汇总的基序位点分布来完成的。还设计了一种用于运行多个无信息交换的独立马尔可夫链(IMC)的独特方法,即IMC基序算法,以与其PMC对应算法进行比较。
实验研究表明,如果使用汇总信息来运行一组基序采样器,性能可以得到提高。新的PMC算法能够提高收敛性,并且在使用模拟和生物基序序列进行测试时优于其他流行算法。