Bi Chengpeng
Bioinformatics and Intelligent Computing Laboratory, Division of Clinical Pharmacology, Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Kansas City, MO 64108, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2009 Jul-Sep;6(3):370-86. doi: 10.1109/TCBB.2008.103.
Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the position weight matrix (PWM) updating technique, this paper presents a Monte Carlo version of the EM motif-finding algorithm that carries out stochastic sampling in local alignment space to overcome the conventional EM's main drawback of being trapped in a local optimum. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. A log-likelihood profiling technique together with the top-k strategy is introduced to cope with the phase shifts and multiple modal issues in motif discovery problem. A novel grouping motif alignment (GMA) algorithm is designed to select motifs by clustering a population of candidate local alignments and successfully applied to subtle motif discovery. MCEMDA compares favorably to other popular PWM-based and word enumerative motif algorithms tested using simulated (l, d)-motif cases, documented prokaryotic, and eukaryotic DNA motif sequences. Finally, MCEMDA is applied to detect large blocks of conserved domains using protein benchmarks and exhibits its excellent capacity while compared with other multiple sequence alignment methods.
基序发现方法在解读基因组中的遗传调控密码(即基序)以及定位蛋白质序列中的保守结构域方面发挥着关键作用。期望最大化(EM)算法是从头基序发现中最常用的方法之一。基于位置权重矩阵(PWM)更新技术,本文提出了一种EM基序查找算法的蒙特卡罗版本,该算法在局部比对空间中进行随机抽样,以克服传统EM算法被困于局部最优的主要缺点。新实现的算法被命名为蒙特卡罗EM基序发现算法(MCEMDA)。MCEMDA从初始模型开始,然后迭代执行蒙特卡罗模拟和参数更新,直至收敛。引入对数似然分析技术和前k策略来处理基序发现问题中的相位偏移和多模态问题。设计了一种新颖的分组基序比对(GMA)算法,通过对一组候选局部比对进行聚类来选择基序,并成功应用于细微基序发现。在使用模拟的(l, d)基序案例、已记录的原核生物和真核生物DNA基序序列进行测试时,MCEMDA与其他流行的基于PWM和词枚举的基序算法相比表现出色。最后,MCEMDA应用于使用蛋白质基准检测保守结构域的大片段,并与其他多序列比对方法相比展现出其卓越的能力。