一种用于生物分子序列中从头基序发现的蒙特卡罗期望最大化算法。

A Monte Carlo EM algorithm for de novo motif discovery in biomolecular sequences.

作者信息

Bi Chengpeng

机构信息

Bioinformatics and Intelligent Computing Laboratory, Division of Clinical Pharmacology, Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Kansas City, MO 64108, USA.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2009 Jul-Sep;6(3):370-86. doi: 10.1109/TCBB.2008.103.

DOI:10.1109/TCBB.2008.103

PMID:19644166

Abstract

Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the position weight matrix (PWM) updating technique, this paper presents a Monte Carlo version of the EM motif-finding algorithm that carries out stochastic sampling in local alignment space to overcome the conventional EM's main drawback of being trapped in a local optimum. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. A log-likelihood profiling technique together with the top-k strategy is introduced to cope with the phase shifts and multiple modal issues in motif discovery problem. A novel grouping motif alignment (GMA) algorithm is designed to select motifs by clustering a population of candidate local alignments and successfully applied to subtle motif discovery. MCEMDA compares favorably to other popular PWM-based and word enumerative motif algorithms tested using simulated (l, d)-motif cases, documented prokaryotic, and eukaryotic DNA motif sequences. Finally, MCEMDA is applied to detect large blocks of conserved domains using protein benchmarks and exhibits its excellent capacity while compared with other multiple sequence alignment methods.

摘要

基序发现方法在解读基因组中的遗传调控密码（即基序）以及定位蛋白质序列中的保守结构域方面发挥着关键作用。期望最大化（EM）算法是从头基序发现中最常用的方法之一。基于位置权重矩阵（PWM）更新技术，本文提出了一种EM基序查找算法的蒙特卡罗版本，该算法在局部比对空间中进行随机抽样，以克服传统EM算法被困于局部最优的主要缺点。新实现的算法被命名为蒙特卡罗EM基序发现算法（MCEMDA）。MCEMDA从初始模型开始，然后迭代执行蒙特卡罗模拟和参数更新，直至收敛。引入对数似然分析技术和前k策略来处理基序发现问题中的相位偏移和多模态问题。设计了一种新颖的分组基序比对（GMA）算法，通过对一组候选局部比对进行聚类来选择基序，并成功应用于细微基序发现。在使用模拟的（l, d）基序案例、已记录的原核生物和真核生物DNA基序序列进行测试时，MCEMDA与其他流行的基于PWM和词枚举的基序算法相比表现出色。最后，MCEMDA应用于使用蛋白质基准检测保守结构域的大片段，并与其他多序列比对方法相比展现出其卓越的能力。

相似文献

A Monte Carlo EM algorithm for de novo motif discovery in biomolecular sequences.一种用于生物分子序列中从头基序发现的蒙特卡罗期望最大化算法。

IEEE/ACM Trans Comput Biol Bioinform. 2009 Jul-Sep;6(3):370-86. doi: 10.1109/TCBB.2008.103.

Memetic algorithms for de novo motif-finding in biomedical sequences.基于 MEME 的生物医学序列从头 motif 发现算法。

Artif Intell Med. 2012 Sep;56(1):1-17. doi: 10.1016/j.artmed.2012.04.002. Epub 2012 May 20.

A profile-based deterministic sequential Monte Carlo algorithm for motif discovery.一种基于轮廓的确定性序贯蒙特卡罗基序发现算法。

Bioinformatics. 2008 Jan 1;24(1):46-55. doi: 10.1093/bioinformatics/btm543. Epub 2007 Nov 17.

On the Monte-Carlo expectation maximization for finding motifs in DNA sequences.关于在 DNA 序列中寻找基序的蒙特卡罗期望最大化。

IEEE J Biomed Health Inform. 2015 Mar;19(2):677-86. doi: 10.1109/JBHI.2014.2322694. Epub 2014 May 8.

SEAM: a Stochastic EM-type Algorithm for Motif-finding in biopolymer sequences.SEAM：一种用于在生物聚合物序列中寻找基序的随机期望最大化（EM）类算法。

J Bioinform Comput Biol. 2007 Feb;5(1):47-77. doi: 10.1142/s0219720007002527.

DNA motif alignment by evolving a population of Markov chains.通过进化马尔可夫链群体进行DNA基序比对。

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S13. doi: 10.1186/1471-2105-10-S1-S13.

HIGEDA: a hierarchical gene-set genetics based algorithm for finding subtle motifs in biological sequences.HIGEDA：一种基于层次基因集遗传学的算法，用于在生物序列中寻找微妙的模体。

Bioinformatics. 2010 Feb 1;26(3):302-9. doi: 10.1093/bioinformatics/btp676. Epub 2009 Dec 8.

Data augmentation algorithms for detecting conserved domains in protein sequences: a comparative study.用于检测蛋白质序列中保守结构域的数据增强算法：一项比较研究。

J Proteome Res. 2008 Jan;7(1):192-201. doi: 10.1021/pr070475q. Epub 2007 Dec 15.

A sequential Monte Carlo EM approach to the transcription factor binding site identification problem.一种用于转录因子结合位点识别问题的序贯蒙特卡罗期望最大化方法。

Bioinformatics. 2007 Jun 1;23(11):1313-20. doi: 10.1093/bioinformatics/btm054. Epub 2007 Mar 25.

Bayesian models and Markov chain Monte Carlo methods for protein motifs with the secondary characteristics.具有二级特征的蛋白质基序的贝叶斯模型和马尔可夫链蒙特卡罗方法。

J Comput Biol. 2005 Sep;12(7):952-70. doi: 10.1089/cmb.2005.12.952.

引用本文的文献

A Review on Planted (, d) Motif Discovery Algorithms for Medical Diagnose.基于（, d）基序发现算法的医学诊断综述。

Sensors (Basel). 2022 Feb 5;22(3):1204. doi: 10.3390/s22031204.

Review of Different Sequence Motif Finding Algorithms.不同序列基序查找算法综述。

Avicenna J Med Biotechnol. 2019 Apr-Jun;11(2):130-148.

An Affinity Propagation-Based DNA Motif Discovery Algorithm.一种基于亲和传播的DNA基序发现算法。

Biomed Res Int. 2015;2015:853461. doi: 10.1155/2015/853461. Epub 2015 Aug 10.

Stochastic EM-based TFBS motif discovery with MITSU.基于随机期望最大化的转录因子结合位点基序发现方法 MITSU。

Bioinformatics. 2014 Jun 15;30(12):i310-8. doi: 10.1093/bioinformatics/btu286.

MCOIN: a novel heuristic for determining transcription factor binding site motif width.MCOIN：一种用于确定转录因子结合位点基序宽度的新型启发式方法。

Algorithms Mol Biol. 2013 Jun 27;8(1):16. doi: 10.1186/1748-7188-8-16.

PairMotif: A new pattern-driven algorithm for planted (l, d) DNA motif search.PairMotif：一种新的基于模式驱动的算法，用于搜索（l，d）DNA 基序。

PLoS One. 2012;7(10):e48442. doi: 10.1371/journal.pone.0048442. Epub 2012 Oct 31.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于生物分子序列中从头基序发现的蒙特卡罗期望最大化算法。

A Monte Carlo EM algorithm for de novo motif discovery in biomolecular sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献