Moses A M, Chiang D Y, Eisen M B
Graduate Group in Biophysics, Center for Integrative Genomics, University of California, Berkeley, USA.
Pac Symp Biocomput. 2004:324-35. doi: 10.1142/9789812704856_0031.
The preferential conservation of transcription factor binding sites implies that non-coding sequence data from related species will prove a powerful asset to motif discovery. We present a unified probabilistic framework for motif discovery that incorporates evolutionary information. We treat aligned DNA sequence as a mixture of evolutionary models, for motif and background, and, following the example of the MEME program, provide an algorithm to estimate the parameters by Expectation-Maximization. We examine a variety of evolutionary models and show that our approach can take advantage of phylogenic information to avoid false positives and discover motifs upstream of groups of characterized target genes. We compare our method to traditional motif finding on only conserved regions. An implementation will be made available at http://rana.lbl.gov.
转录因子结合位点的优先保守性意味着来自相关物种的非编码序列数据将被证明是基序发现的有力资源。我们提出了一个统一的概率框架用于基序发现,该框架纳入了进化信息。我们将比对后的DNA序列视为基序和背景的进化模型的混合,并以MEME程序为例,提供一种通过期望最大化来估计参数的算法。我们研究了多种进化模型,并表明我们的方法可以利用系统发育信息来避免假阳性,并发现已表征的靶基因群体上游的基序。我们将我们的方法与仅在保守区域进行传统基序查找的方法进行了比较。该方法的实现将在http://rana.lbl.gov上提供。