Li Leping
Biostatistics Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA.
J Comput Biol. 2009 Feb;16(2):317-29. doi: 10.1089/cmb.2008.16TT.
Genome-wide analyses of protein binding sites generate large amounts of data; a ChIP dataset might contain 10,000 sites. Unbiased motif discovery in such datasets is not generally feasible using current methods that employ probabilistic models. We propose an efficient method, GADEM, which combines spaced dyads and an expectation-maximization (EM) algorithm. Candidate words (four to six nucleotides) for constructing spaced dyads are prioritized by their degree of overrepresentation in the input sequence data. Spaced dyads are converted into starting position weight matrices (PWMs). GADEM then employs a genetic algorithm (GA), with an embedded EM algorithm to improve starting PWMs, to guide the evolution of a population of spaced dyads toward one whose entropy scores are more statistically significant. Spaced dyads whose entropy scores reach a pre-specified significance threshold are declared motifs. GADEM performed comparably with MEME on 500 sets of simulated "ChIP" sequences with embedded known P53 binding sites. The major advantage of GADEM is its computational efficiency on large ChIP datasets compared to competitors. We applied GADEM to six genome-wide ChIP datasets. Approximately, 15 to 30 motifs of various lengths were identified in each dataset. Remarkably, without any prior motif information, the expected known motif (e.g., P53 in P53 data) was identified every time. GADEM discovered motifs of various lengths (6-40 bp) and characteristics in these datasets containing from 0.5 to >13 million nucleotides with run times of 5 to 96 h. GADEM can be viewed as an extension of the well-known MEME algorithm and is an efficient tool for de novo motif discovery in large-scale genome-wide data. The GADEM software is available at (www.niehs.nih.gov/research/resources/software/GADEM/).
全基因组蛋白质结合位点分析会产生大量数据;一个染色质免疫沉淀(ChIP)数据集可能包含10000个位点。使用当前采用概率模型的方法,在此类数据集中进行无偏基序发现通常不可行。我们提出了一种有效的方法GADEM,它结合了间隔二联体和期望最大化(EM)算法。用于构建间隔二联体的候选词(四到六个核苷酸)根据其在输入序列数据中的过表达程度进行优先级排序。间隔二联体被转换为起始位置权重矩阵(PWM)。然后,GADEM采用遗传算法(GA),并嵌入EM算法以改进起始PWM,引导一群间隔二联体朝着熵得分更具统计学意义的方向进化。熵得分达到预先指定的显著性阈值的间隔二联体被宣布为基序。在500组嵌入已知P53结合位点的模拟“ChIP”序列上,GADEM的表现与MEME相当。与竞争对手相比,GADEM的主要优势在于其在大型ChIP数据集上的计算效率。我们将GADEM应用于六个全基因组ChIP数据集。每个数据集中大约鉴定出15到30个不同长度的基序。值得注意的是,在没有任何先前基序信息的情况下,每次都能鉴定出预期的已知基序(例如P53数据中的P53)。GADEM在这些包含0.5到超过1300万个核苷酸的数据集中发现了各种长度(6 - 40 bp)和特征的基序,运行时间为5到96小时。GADEM可以被视为著名的MEME算法的扩展,是在大规模全基因组数据中进行从头基序发现的有效工具。GADEM软件可在(www.niehs.nih.gov/research/resources/software/GADEM/)获取。