Buhler Jeremy, Tompa Martin
Department of Computer Science, Box 1045, Washington University, One Brookings Drive, St. Louis, MO 63130, USA.
J Comput Biol. 2002;9(2):225-42. doi: 10.1089/10665270252935430.
The DNA motif discovery problem abstracts the task of discovering short, conserved sites in genomic DNA. Pevzner and Sze recently described a precise combinatorial formulation of motif discovery that motivates the following algorithmic challenge: find twenty planted occurrences of a motif of length fifteen in roughly twelve kilobases of genomic sequence, where each occurrence of the motif differs from its consensus in four randomly chosen positions. Such "subtle" motifs, though statistically highly significant, expose a weakness in existing motif-finding algorithms, which typically fail to discover them. Pevzner and Sze introduced new algorithms to solve their (15,4)-motif challenge, but these methods do not scale efficiently to more difficult problems in the same family, such as the (14,4)-, (16,5)-, and (18,6)-motif problems. We introduce a novel motif-discovery algorithm, PROJECTION, designed to enhance the performance of existing motif finders using random projections of the input's substrings. Experiments on synthetic data demonstrate that PROJECTION remedies the weakness observed in existing algorithms, typically solving the difficult (14,4)-, (16,5)-, and (18,6)-motif problems. Our algorithm is robust to nonuniform background sequence distributions and scales to larger amounts of sequence than that specified in the original challenge. A probabilistic estimate suggests that related motif-finding problems that PROJECTION fails to solve are in all likelihood inherently intractable. We also test the performance of our algorithm on realistic biological examples, including transcription factor binding sites in eukaryotes and ribosome binding sites in prokaryotes.
DNA基序发现问题概括了在基因组DNA中发现短的保守位点的任务。佩夫兹纳和斯泽最近描述了一种精确的基序发现组合公式,这引发了以下算法挑战:在大约12千碱基的基因组序列中找到20个长度为15的基序植入实例,其中基序的每个实例在四个随机选择的位置与其共有序列不同。这种“微妙”的基序虽然在统计学上具有高度显著性,但暴露了现有基序查找算法的一个弱点,即这些算法通常无法发现它们。佩夫兹纳和斯泽引入了新算法来解决他们的(15,4)-基序挑战,但这些方法不能有效地扩展到同一类中更困难的问题,如(14,4)-、(16,5)-和(18,6)-基序问题。我们引入了一种新颖的基序发现算法PROJECTION,旨在通过对输入子串的随机投影来提高现有基序查找器的性能。对合成数据的实验表明,PROJECTION弥补了现有算法中观察到的弱点,通常能解决困难的(14,4)-、(16,5)-和(18,6)-基序问题。我们的算法对非均匀背景序列分布具有鲁棒性,并且能够扩展到比原始挑战中指定的序列量更大的情况。概率估计表明,PROJECTION未能解决的相关基序查找问题很可能本质上是难以处理的。我们还在实际生物学实例上测试了我们算法的性能,包括真核生物中的转录因子结合位点和原核生物中的核糖体结合位点。