STEME：高效的 EM 算法，用于在大数据集中发现模式。

STEME: efficient EM to find motifs in large data sets.

机构信息

MRC Biostatistics Unit, Institute of Public Health, Forvie Site, Robinson Way, Cambridge CB2 0SR, UK.

出版信息

Nucleic Acids Res. 2011 Oct;39(18):e126. doi: 10.1093/nar/gkr574. Epub 2011 Jul 23.

DOI:10.1093/nar/gkr574

PMID:21785132

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3185442/

Abstract

MEME and many other popular motif finders use the expectation-maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms.

摘要

MEME 和许多其他流行的模体发现工具使用期望最大化（EM）算法来优化其参数。不幸的是，EM 的运行时间与输入序列的长度呈线性关系。这可能会禁止其应用于高通量生物技术通常生成的数据集的大小。后缀树是一种可以有效地索引一组序列的数据结构。我们描述了一种算法，即模体提取的后缀树 EM（STEME），它使用后缀树来近似 EM。据我们所知，这是后缀树首次应用于 EM。我们对算法的预期运行时间进行了分析，并证明 STEME 的运行速度比 MEME 使用的 EM 实现快一个数量级。我们给出了逼近质量的理论界限，并表明，在实践中，逼近对结果的影响可以忽略不计。我们提供了算法的开源实现，我们希望它将用于加速现有的和未来的模体搜索算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/013f/3185442/bc31a084a79d/gkr574f1.jpg

相似文献

STEME: efficient EM to find motifs in large data sets.

Nucleic Acids Res. 2011 Oct;39(18):e126. doi: 10.1093/nar/gkr574. Epub 2011 Jul 23.

STEME: a robust, accurate motif finder for large data sets.

PLoS One. 2014 Mar 13;9(3):e90735. doi: 10.1371/journal.pone.0090735. eCollection 2014.

Sequential Integration of Fuzzy Clustering and Expectation Maximization for Transcription Factor Binding Site Identification.

J Comput Biol. 2018 Nov;25(11):1247-1256. doi: 10.1089/cmb.2017.0230. Epub 2018 Aug 22.

EXTREME: an online EM algorithm for motif discovery.

Bioinformatics. 2014 Jun 15;30(12):1667-73. doi: 10.1093/bioinformatics/btu093. Epub 2014 Feb 14.

Stochastic EM-based TFBS motif discovery with MITSU.

Bioinformatics. 2014 Jun 15;30(12):i310-8. doi: 10.1093/bioinformatics/btu286.

HIGEDA: a hierarchical gene-set genetics based algorithm for finding subtle motifs in biological sequences.

Bioinformatics. 2010 Feb 1;26(3):302-9. doi: 10.1093/bioinformatics/btp676. Epub 2009 Dec 8.

The value of position-specific priors in motif discovery using MEME.

BMC Bioinformatics. 2010 Apr 9;11:179. doi: 10.1186/1471-2105-11-179.

A fast weak motif-finding algorithm based on community detection in graphs.

BMC Bioinformatics. 2013 Jul 17;14:227. doi: 10.1186/1471-2105-14-227.

Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm.

J Comput Biol. 2013 Mar;20(3):237-48. doi: 10.1089/cmb.2012.0233.

MEME: discovering and analyzing DNA and protein sequence motifs.

Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W369-73. doi: 10.1093/nar/gkl198.

引用本文的文献

TFBSFootprinter: a multiomics tool for prediction of transcription factor binding sites in vertebrate species.

Transcription. 2025 Apr-Jun;16(2-3):204-223. doi: 10.1080/21541264.2025.2521764. Epub 2025 Jul 11.

MicrosatNavigator: exploring nonrandom distribution and lineage-specificity of microsatellite repeat motifs on vertebrate sex chromosomes across 186 whole genomes.

Chromosome Res. 2023 Sep 30;31(4):29. doi: 10.1007/s10577-023-09738-4.

A survey on algorithms to characterize transcription factor binding sites.

Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad156.

STREME: accurate and versatile sequence motif discovery.

Bioinformatics. 2021 Sep 29;37(18):2834-2840. doi: 10.1093/bioinformatics/btab203.

A noncanonical AR addiction drives enzalutamide resistance in prostate cancer.

Nat Commun. 2021 Mar 9;12(1):1521. doi: 10.1038/s41467-021-21860-7.

A Clustering Approach for Motif Discovery in ChIP-Seq Dataset.

Entropy (Basel). 2019 Aug 16;21(8):802. doi: 10.3390/e21080802.

ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery.

Bioinformatics. 2019 Nov 1;35(22):4632-4639. doi: 10.1093/bioinformatics/btz290.

Review of Different Sequence Motif Finding Algorithms.

Avicenna J Med Biotechnol. 2019 Apr-Jun;11(2):130-148.

SArKS: de novo discovery of gene expression regulatory motif sites and domains by suffix array kernel smoothing.

Bioinformatics. 2019 Oct 15;35(20):3944-3952. doi: 10.1093/bioinformatics/btz198.

GSMC: Combining Parallel Gibbs Sampling with Maximal Cliques for Hunting DNA Motif.

J Comput Biol. 2017 Dec;24(12):1243-1253. doi: 10.1089/cmb.2017.0100. Epub 2017 Nov 8.

本文引用的文献

Stochastic relaxation, gibbs distributions, and the bayesian restoration of images.

IEEE Trans Pattern Anal Mach Intell. 1984 Jun;6(6):721-41. doi: 10.1109/tpami.1984.4767596.

Variable structure motifs for transcription factor binding sites.

BMC Genomics. 2010 Jan 14;11:30. doi: 10.1186/1471-2164-11-30.

ChIP-seq: advantages and challenges of a maturing technology.

Nat Rev Genet. 2009 Oct;10(10):669-80. doi: 10.1038/nrg2641. Epub 2009 Sep 8.

Transcriptional programs: modelling higher order structure in transcriptional control.

BMC Bioinformatics. 2009 Jul 16;10:218. doi: 10.1186/1471-2105-10-218.

ChIP-Chip: algorithms for calling binding sites.

Methods Mol Biol. 2009;556:165-75. doi: 10.1007/978-1-60327-192-9_12.

Using ChIP-chip and ChIP-seq to study the regulation of gene expression: genome-wide localization studies reveal widespread regulation of transcription elongation.

Methods. 2009 Aug;48(4):398-408. doi: 10.1016/j.ymeth.2009.02.024. Epub 2009 Mar 9.

GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery.

J Comput Biol. 2009 Feb;16(2):317-29. doi: 10.1089/cmb.2008.16TT.

Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data.

Nat Methods. 2008 Sep;5(9):829-34. doi: 10.1038/nmeth.1246.

An integrated software system for analyzing ChIP-chip and ChIP-seq data.

Nat Biotechnol. 2008 Nov;26(11):1293-300. doi: 10.1038/nbt.1505. Epub 2008 Nov 2.

Priming for T helper type 2 differentiation by interleukin 2-mediated induction of interleukin 4 receptor alpha-chain expression.

Nat Immunol. 2008 Nov;9(11):1288-96. doi: 10.1038/ni.1656. Epub 2008 Sep 28.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

STEME：高效的 EM 算法，用于在大数据集中发现模式。

STEME: efficient EM to find motifs in large data sets.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献