主题模型在基序查找算法中的应用研究。

A study on the application of topic models to motif finding algorithms.

作者信息

Basha Gutierrez Josep, Nakai Kenta

机构信息

Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan.

Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, 108-8639, Tokyo, Japan.

出版信息

BMC Bioinformatics. 2016 Dec 22;17(Suppl 19):502. doi: 10.1186/s12859-016-1364-3.

DOI:10.1186/s12859-016-1364-3

PMID:28155646

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5259985/

Abstract

BACKGROUND

Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients.

RESULTS

The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level.

CONCLUSIONS

The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences.

摘要

背景

主题模型是一种统计算法，旨在根据文档中包含的抽象主题来发现一组文档的结构。在此，我们尝试将这种方法应用于发现一组生物序列中所含转录因子结合位点（TFBS）的结构，这是分子生物学研究中理解转录调控的一个基本问题。我们在此介绍两种利用主题模型进行基序查找的方法。首先，我们开发了一种算法，先将一组生物序列视为文本文档，将其中包含的k-mer视为单词，然后构建相关主题模型（CTM）并迭代降低其困惑度。我们还利用CTM的困惑度测量来改进我们之前基于遗传算法和几个统计系数的算法。

结果

使用来自四个不同物种的56个数据集对算法进行了测试，并通过核苷酸和位点水平的几个系数与其他14种方法进行了比较。我们第一种方法的结果显示出与其他研究方法相当的性能，尤其是在位点水平和灵敏度得分方面，其得分优于14种现有工具中的任何一种。对于我们之前的算法，添加了困惑度测量的新方法在核苷酸和位点水平的灵敏度以及位点水平的整体性能方面明显优于所有其他方法。

结论

获得的统计数据表明，基于使用CTM的基序查找方法的性能足够令人满意，足以得出主题模型的应用是开发基序查找算法的有效方法这一结论。此外，将主题模型添加到先前开发的方法中显著提高了其性能，表明这种组合算法可以成为成功预测不同类型DNA序列集中基序的有用工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a581/5259985/46a4e599c28c/12859_2016_1364_Fig1_HTML.jpg

相似文献

A study on the application of topic models to motif finding algorithms.主题模型在基序查找算法中的应用研究。

BMC Bioinformatics. 2016 Dec 22;17(Suppl 19):502. doi: 10.1186/s12859-016-1364-3.

A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs.基于蒙特卡罗的框架增强了调控序列基序的发现和解释。

BMC Bioinformatics. 2012 Nov 27;13:317. doi: 10.1186/1471-2105-13-317.

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data.从ChIP-seq数据推断DNA结合位点的基序内依赖性。

BMC Bioinformatics. 2015 Nov 9;16:375. doi: 10.1186/s12859-015-0797-4.

Sequential Integration of Fuzzy Clustering and Expectation Maximization for Transcription Factor Binding Site Identification.用于转录因子结合位点识别的模糊聚类与期望最大化的顺序集成

J Comput Biol. 2018 Nov;25(11):1247-1256. doi: 10.1089/cmb.2017.0230. Epub 2018 Aug 22.

Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets.用于从大型序列集中发现多样基序的排斥并行马尔可夫链蒙特卡罗算法。

Bioinformatics. 2015 May 15;31(10):1561-8. doi: 10.1093/bioinformatics/btv017. Epub 2015 Jan 11.

Discovering Gene Regulatory Elements Using Coverage-Based Heuristics.基于覆盖度启发式算法的基因调控元件发现

IEEE/ACM Trans Comput Biol Bioinform. 2018 Jul-Aug;15(4):1290-1300. doi: 10.1109/TCBB.2015.2496261. Epub 2015 Oct 30.

Parametric bootstrapping for biological sequence motifs.生物序列基序的参数自举法

BMC Bioinformatics. 2016 Oct 6;17(1):406. doi: 10.1186/s12859-016-1246-8.

GSMC: Combining Parallel Gibbs Sampling with Maximal Cliques for Hunting DNA Motif.GSMC：结合并行吉布斯采样与最大团来寻找DNA基序

J Comput Biol. 2017 Dec;24(12):1243-1253. doi: 10.1089/cmb.2017.0100. Epub 2017 Nov 8.

Stochastic EM-based TFBS motif discovery with MITSU.基于随机期望最大化的转录因子结合位点基序发现方法 MITSU。

Bioinformatics. 2014 Jun 15;30(12):i310-8. doi: 10.1093/bioinformatics/btu286.

A cluster refinement algorithm for motif discovery.一种用于发现模体的簇精炼算法。

IEEE/ACM Trans Comput Biol Bioinform. 2010 Oct-Dec;7(4):654-68. doi: 10.1109/TCBB.2009.25.

引用本文的文献

Bioinformatics and systems biology research update from the 15 International Conference on Bioinformatics (InCoB2016).来自第15届国际生物信息学会议（InCoB2016）的生物信息学与系统生物学研究进展

BMC Bioinformatics. 2016 Dec 22;17(Suppl 19):524. doi: 10.1186/s12859-016-1409-7.

本文引用的文献

A statistical thin-tail test of predicting regulatory regions in the Drosophila genome.一种预测果蝇基因组调控区域的统计薄尾检验。

Theor Biol Med Model. 2013 Feb 14;10:11. doi: 10.1186/1742-4682-10-11.

A survey of DNA motif finding algorithms.DNA基序查找算法综述。

BMC Bioinformatics. 2007 Nov 1;8 Suppl 7(Suppl 7):S21. doi: 10.1186/1471-2105-8-S7-S21.

Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test.调控DNA序列的一些统计特性及其在预测果蝇基因组调控区域中的应用：蓬松尾检验

BMC Bioinformatics. 2005 Apr 27;6:109. doi: 10.1186/1471-2105-6-109.

A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length.一种用于识别具有对称结构、间隔的DNA基序并改进信号长度估计的吉布斯采样器。

Bioinformatics. 2005 May 15;21(10):2240-5. doi: 10.1093/bioinformatics/bti336. Epub 2005 Feb 22.

Assessing computational tools for the discovery of transcription factor binding sites.评估用于发现转录因子结合位点的计算工具。

Nat Biotechnol. 2005 Jan;23(1):137-44. doi: 10.1038/nbt1053.

Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR.由PHA-4/FoxA和DAF-12/NHR介导的环境诱导前肠重塑。

Science. 2004 Sep 17;305(5691):1743-6. doi: 10.1126/science.1102216.

Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes.Weeder Web：在一组共调控基因的序列中发现转录因子结合位点

Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W199-203. doi: 10.1093/nar/gkh465.

Finding functional sequence elements by multiple local alignment.通过多重局部比对寻找功能序列元件。

Nucleic Acids Res. 2004 Jan 2;32(1):189-200. doi: 10.1093/nar/gkh169. Print 2004.

YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation.YMF：一个通过统计过度代表性来发现新型转录因子结合位点的程序。

Nucleic Acids Res. 2003 Jul 1;31(13):3586-8. doi: 10.1093/nar/gkg618.

Finding composite regulatory patterns in DNA sequences.在DNA序列中寻找复合调控模式。

Bioinformatics. 2002;18 Suppl 1:S354-63. doi: 10.1093/bioinformatics/18.suppl_1.s354.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

主题模型在基序查找算法中的应用研究。

A study on the application of topic models to motif finding algorithms.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献