Suppr超能文献

STEME:高效的 EM 算法,用于在大数据集中发现模式。

STEME: efficient EM to find motifs in large data sets.

机构信息

MRC Biostatistics Unit, Institute of Public Health, Forvie Site, Robinson Way, Cambridge CB2 0SR, UK.

出版信息

Nucleic Acids Res. 2011 Oct;39(18):e126. doi: 10.1093/nar/gkr574. Epub 2011 Jul 23.

Abstract

MEME and many other popular motif finders use the expectation-maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms.

摘要

MEME 和许多其他流行的模体发现工具使用期望最大化(EM)算法来优化其参数。不幸的是,EM 的运行时间与输入序列的长度呈线性关系。这可能会禁止其应用于高通量生物技术通常生成的数据集的大小。后缀树是一种可以有效地索引一组序列的数据结构。我们描述了一种算法,即模体提取的后缀树 EM(STEME),它使用后缀树来近似 EM。据我们所知,这是后缀树首次应用于 EM。我们对算法的预期运行时间进行了分析,并证明 STEME 的运行速度比 MEME 使用的 EM 实现快一个数量级。我们给出了逼近质量的理论界限,并表明,在实践中,逼近对结果的影响可以忽略不计。我们提供了算法的开源实现,我们希望它将用于加速现有的和未来的模体搜索算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/013f/3185442/bc31a084a79d/gkr574f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验