Habib Naomi, Kaplan Tommy, Margalit Hanah, Friedman Nir
School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel.
PLoS Comput Biol. 2008 Feb 29;4(2):e1000010. doi: 10.1371/journal.pcbi.1000010.
Characterizing the DNA-binding specificities of transcription factors is a key problem in computational biology that has been addressed by multiple algorithms. These usually take as input sequences that are putatively bound by the same factor and output one or more DNA motifs. A common practice is to apply several such algorithms simultaneously to improve coverage at the price of redundancy. In interpreting such results, two tasks are crucial: clustering of redundant motifs, and attributing the motifs to transcription factors by retrieval of similar motifs from previously characterized motif libraries. Both tasks inherently involve motif comparison. Here we present a novel method for comparing and merging motifs, based on Bayesian probabilistic principles. This method takes into account both the similarity in positional nucleotide distributions of the two motifs and their dissimilarity to the background distribution. We demonstrate the use of the new comparison method as a basis for motif clustering and retrieval procedures, and compare it to several commonly used alternatives. Our results show that the new method outperforms other available methods in accuracy and sensitivity. We incorporated the resulting motif clustering and retrieval procedures in a large-scale automated pipeline for analyzing DNA motifs. This pipeline integrates the results of various DNA motif discovery algorithms and automatically merges redundant motifs from multiple training sets into a coherent annotated library of motifs. Application of this pipeline to recent genome-wide transcription factor location data in S. cerevisiae successfully identified DNA motifs in a manner that is as good as semi-automated analysis reported in the literature. Moreover, we show how this analysis elucidates the mechanisms of condition-specific preferences of transcription factors.
表征转录因子的DNA结合特异性是计算生物学中的一个关键问题,已有多种算法对其进行了研究。这些算法通常将假定由同一因子结合的序列作为输入,并输出一个或多个DNA基序。一种常见的做法是同时应用几种这样的算法,以冗余为代价提高覆盖率。在解释这些结果时,有两项任务至关重要:对冗余基序进行聚类,以及通过从先前表征的基序库中检索相似基序,将这些基序归因于转录因子。这两项任务本质上都涉及基序比较。在此,我们提出一种基于贝叶斯概率原理的基序比较与合并新方法。该方法既考虑了两个基序在位置核苷酸分布上的相似性,也考虑了它们与背景分布的差异。我们展示了将这种新的比较方法用作基序聚类和检索程序的基础,并将其与几种常用的替代方法进行比较。我们的结果表明,新方法在准确性和灵敏度方面优于其他现有方法。我们将所得的基序聚类和检索程序整合到一个用于分析DNA基序的大规模自动化流程中。该流程整合了各种DNA基序发现算法的结果,并自动将来自多个训练集的冗余基序合并到一个连贯的带注释基序库中。将此流程应用于酿酒酵母最近的全基因组转录因子定位数据,成功地识别出了DNA基序,其效果与文献中报道的半自动分析相当。此外,我们展示了这种分析如何阐明转录因子条件特异性偏好的机制。