Suppr超能文献

通过期望最大化算法同时学习DNA基序及其位置和序列排名偏好。

Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm.

作者信息

Zhang ZhiZhuo, Chang Cheng Wei, Hugo Willy, Cheung Edwin, Sung Wing-Kin

机构信息

National University of Singapore, Singapore, Singapore.

出版信息

J Comput Biol. 2013 Mar;20(3):237-48. doi: 10.1089/cmb.2012.0233.

Abstract

Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e., position preference and sequence rank preference). This information is usually required from the user. This article presents a de novo motif discovery algorithm called SEME (sampling with expectation maximization for motif elicitation), which uses pure probabilistic mixture model to model the motif's binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position, and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large-scale synthetic datasets, 32 metazoan compendium benchmark datasets, and 164 chromatin immunoprecipitation sequencing (ChIP-Seq) libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (coTF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct coTF motifs and, at the same time, predicted coTF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each coTF reveals potential interaction mechanisms between the primary TF and the coTF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the coTFs. The application is available online.

摘要

虽然可以通过挖掘过度呈现的序列模式来发现从头基序,但这种方法会遗漏一些真实的基序并产生许多假阳性。为了提高准确性,一种解决方案是考虑一些额外的结合特征(即位置偏好和序列排名偏好)。通常需要用户提供此信息。本文提出了一种称为SEME(用于基序引出的期望最大化采样)的从头基序发现算法,该算法使用纯概率混合模型对基序的结合特征进行建模,并使用期望最大化(EM)算法同时学习序列基序、位置和序列排名偏好,而无需向用户询问任何先验知识。由于两项重要技术:可变基序长度扩展和重要性采样,SEME既高效又准确。使用75个大规模合成数据集、32个后生动物纲要基准数据集和164个染色质免疫沉淀测序(ChIP-Seq)文库,我们证明了SEME在寻找转录因子(TF)结合位点方面优于现有程序。SEME进一步应用于在15个ChIP-Seq文库中寻找共调控TF(coTF)基序这一更具挑战性的问题。它识别出了显著更多正确的coTF基序,同时预测的coTF基序与已知基序的匹配度更高。最后,我们表明每个coTF的学习到的位置和序列排名偏好揭示了这些位点内主要TF和coTF之间潜在的相互作用机制。其中一些发现通过coTF的ChIP-Seq实验得到了进一步验证。该应用程序可在线获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验