GADEM：一种遗传算法引导的间隔二元组形成，结合期望最大化算法用于基序发现。

GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery.

作者信息

Li Leping

机构信息

Biostatistics Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA.

出版信息

J Comput Biol. 2009 Feb;16(2):317-29. doi: 10.1089/cmb.2008.16TT.

DOI:10.1089/cmb.2008.16TT

PMID:19193149

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2756050/

Abstract

Genome-wide analyses of protein binding sites generate large amounts of data; a ChIP dataset might contain 10,000 sites. Unbiased motif discovery in such datasets is not generally feasible using current methods that employ probabilistic models. We propose an efficient method, GADEM, which combines spaced dyads and an expectation-maximization (EM) algorithm. Candidate words (four to six nucleotides) for constructing spaced dyads are prioritized by their degree of overrepresentation in the input sequence data. Spaced dyads are converted into starting position weight matrices (PWMs). GADEM then employs a genetic algorithm (GA), with an embedded EM algorithm to improve starting PWMs, to guide the evolution of a population of spaced dyads toward one whose entropy scores are more statistically significant. Spaced dyads whose entropy scores reach a pre-specified significance threshold are declared motifs. GADEM performed comparably with MEME on 500 sets of simulated "ChIP" sequences with embedded known P53 binding sites. The major advantage of GADEM is its computational efficiency on large ChIP datasets compared to competitors. We applied GADEM to six genome-wide ChIP datasets. Approximately, 15 to 30 motifs of various lengths were identified in each dataset. Remarkably, without any prior motif information, the expected known motif (e.g., P53 in P53 data) was identified every time. GADEM discovered motifs of various lengths (6-40 bp) and characteristics in these datasets containing from 0.5 to >13 million nucleotides with run times of 5 to 96 h. GADEM can be viewed as an extension of the well-known MEME algorithm and is an efficient tool for de novo motif discovery in large-scale genome-wide data. The GADEM software is available at (www.niehs.nih.gov/research/resources/software/GADEM/).

摘要

全基因组蛋白质结合位点分析会产生大量数据；一个染色质免疫沉淀（ChIP）数据集可能包含10000个位点。使用当前采用概率模型的方法，在此类数据集中进行无偏基序发现通常不可行。我们提出了一种有效的方法GADEM，它结合了间隔二联体和期望最大化（EM）算法。用于构建间隔二联体的候选词（四到六个核苷酸）根据其在输入序列数据中的过表达程度进行优先级排序。间隔二联体被转换为起始位置权重矩阵（PWM）。然后，GADEM采用遗传算法（GA），并嵌入EM算法以改进起始PWM，引导一群间隔二联体朝着熵得分更具统计学意义的方向进化。熵得分达到预先指定的显著性阈值的间隔二联体被宣布为基序。在500组嵌入已知P53结合位点的模拟“ChIP”序列上，GADEM的表现与MEME相当。与竞争对手相比，GADEM的主要优势在于其在大型ChIP数据集上的计算效率。我们将GADEM应用于六个全基因组ChIP数据集。每个数据集中大约鉴定出15到30个不同长度的基序。值得注意的是，在没有任何先前基序信息的情况下，每次都能鉴定出预期的已知基序（例如P53数据中的P53）。GADEM在这些包含0.5到超过1300万个核苷酸的数据集中发现了各种长度（6 - 40 bp）和特征的基序，运行时间为5到96小时。GADEM可以被视为著名的MEME算法的扩展，是在大规模全基因组数据中进行从头基序发现的有效工具。GADEM软件可在（www.niehs.nih.gov/research/resources/software/GADEM/）获取。

相似文献

GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery.

J Comput Biol. 2009 Feb;16(2):317-29. doi: 10.1089/cmb.2008.16TT.

The value of position-specific priors in motif discovery using MEME.

BMC Bioinformatics. 2010 Apr 9;11:179. doi: 10.1186/1471-2105-11-179.

fdrMotif: identifying cis-elements by an EM algorithm coupled with false discovery rate control.

Bioinformatics. 2008 Mar 1;24(5):629-36. doi: 10.1093/bioinformatics/btn009. Epub 2008 Feb 22.

Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm.

J Comput Biol. 2013 Mar;20(3):237-48. doi: 10.1089/cmb.2012.0233.

coMOTIF: a mixture framework for identifying transcription factor and a coregulator motif in ChIP-seq data.

Bioinformatics. 2011 Oct 1;27(19):2625-32. doi: 10.1093/bioinformatics/btr397. Epub 2011 Jul 19.

A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs.

BMC Bioinformatics. 2012 Nov 27;13:317. doi: 10.1186/1471-2105-13-317.

A new algorithm for DNA motif discovery using multiple sample sequence sets.

J Bioinform Comput Biol. 2019 Aug;17(4):1950021. doi: 10.1142/S0219720019500215.

GAPWM: a genetic algorithm method for optimizing a position weight matrix.

Bioinformatics. 2007 May 15;23(10):1188-94. doi: 10.1093/bioinformatics/btm080. Epub 2007 Mar 6.

BayesMotif: de novo protein sorting motif discovery from impure datasets.

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S66. doi: 10.1186/1471-2105-11-S1-S66.

A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets.

Biomed Res Int. 2015;2015:218068. doi: 10.1155/2015/218068. Epub 2015 Jul 5.

引用本文的文献

A survey on algorithms to characterize transcription factor binding sites.

Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad156.

TLE3 Sustains Luminal Breast Cancer Lineage Fidelity to Suppress Metastasis.

Cancer Res. 2023 Apr 4;83(7):997-1015. doi: 10.1158/0008-5472.CAN-22-3133.

BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin.

Genome Biol. 2022 Aug 15;23(1):174. doi: 10.1186/s13059-022-02723-w.

Single base-pair resolution analysis of DNA binding motif with MoMotif reveals an oncogenic function of CTCF zinc-finger 1 mutation.

Nucleic Acids Res. 2022 Aug 26;50(15):8441-8458. doi: 10.1093/nar/gkac658.

Structural and genome-wide analyses suggest that transposon-derived protein SETMAR alters transcription and splicing.

J Biol Chem. 2022 May;298(5):101894. doi: 10.1016/j.jbc.2022.101894. Epub 2022 Apr 1.

Dynamic transcriptome analysis reveals signatures of paradoxical effect of vemurafenib on human dermal fibroblasts.

Cell Commun Signal. 2021 Dec 20;19(1):123. doi: 10.1186/s12964-021-00801-3.

Identification of Cis-Regulatory Sequences Controlling Pollen-Specific Expression of Hydroxyproline-Rich Glycoprotein Genes in .

Plants (Basel). 2020 Dec 10;9(12):1751. doi: 10.3390/plants9121751.

Prediction pipeline for discovery of regulatory motifs associated with Brugia malayi molting.

PLoS Negl Trop Dis. 2020 Jun 23;14(6):e0008275. doi: 10.1371/journal.pntd.0008275. eCollection 2020 Jun.

Sex chromosome evolution in parasitic nematodes of humans.

Nat Commun. 2020 Apr 23;11(1):1964. doi: 10.1038/s41467-020-15654-6.

Time-Course Transcriptome Study Reveals Mode of bZIP Transcription Factors on Light Exposure in .

Int J Mol Sci. 2020 Mar 14;21(6):1993. doi: 10.3390/ijms21061993.

本文引用的文献

Combinatorial patterns of histone acetylations and methylations in the human genome.

Nat Genet. 2008 Jul;40(7):897-903. doi: 10.1038/ng.154. Epub 2008 Jun 15.

Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets.

Genome Res. 2008 Jul;18(7):1180-9. doi: 10.1101/gr.076117.108. Epub 2008 Apr 14.

Whole-genome analysis of histone H3 lysine 4 and lysine 27 methylation in human embryonic stem cells.

Cell Stem Cell. 2007 Sep 13;1(3):299-312. doi: 10.1016/j.stem.2007.08.003.

Dynamic regulation of nucleosome positioning in the human genome.

Cell. 2008 Mar 7;132(5):887-98. doi: 10.1016/j.cell.2008.02.022.

fdrMotif: identifying cis-elements by an EM algorithm coupled with false discovery rate control.

Bioinformatics. 2008 Mar 1;24(5):629-36. doi: 10.1093/bioinformatics/btn009. Epub 2008 Feb 22.

A universal framework for regulatory element discovery across all genomes and data types.

Mol Cell. 2007 Oct 26;28(2):337-50. doi: 10.1016/j.molcel.2007.09.027.

Finding regulatory elements and regulatory motifs: a general probabilistic framework.

BMC Bioinformatics. 2007 Sep 27;8 Suppl 6(Suppl 6):S4. doi: 10.1186/1471-2105-8-S6-S4.

Genome-wide maps of chromatin state in pluripotent and lineage-committed cells.

Nature. 2007 Aug 2;448(7153):553-60. doi: 10.1038/nature06008. Epub 2007 Jul 1.

Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.

Nat Methods. 2007 Aug;4(8):651-7. doi: 10.1038/nmeth1068. Epub 2007 Jun 11.

Whole-genome cartography of estrogen receptor alpha binding sites.

PLoS Genet. 2007 Jun;3(6):e87. doi: 10.1371/journal.pgen.0030087. Epub 2007 Apr 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

GADEM：一种遗传算法引导的间隔二元组形成，结合期望最大化算法用于基序发现。

GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献