Pape Utz J, Rahmann Sven, Vingron Martin
Computational Biology, Max Planck Institute f. Molecular Genetics, Ihnestr. 73, 14195 Berlin, Germany.
Bioinformatics. 2008 Feb 1;24(3):350-7. doi: 10.1093/bioinformatics/btm610. Epub 2008 Jan 2.
Transcription factors (TFs) play a key role in gene regulation by binding to target sequences. In silico prediction of potential binding of a TF to a binding site is a well-studied problem in computational biology. The binding sites for one TF are represented by a position frequency matrix (PFM). The discovery of new PFMs requires the comparison to known PFMs to avoid redundancies. In general, two PFMs are similar if they occur at overlapping positions under a null model. Still, most existing methods compute similarity according to probabilistic distances of the PFMs. Here we propose a natural similarity measure based on the asymptotic covariance between the number of PFM hits incorporating both strands. Furthermore, we introduce a second measure based on the same idea to cluster a set of the Jaspar PFMs.
We show that the asymptotic covariance can be efficiently computed by a two dimensional convolution of the score distributions. The asymptotic covariance approach shows strong correlation with simulated data. It outperforms three alternative methods. The Jaspar clustering yields distinct groups of TFs of the same class. Furthermore, a representative PFM is given for each class. In contrast to most other clustering methods, PFMs with low similarity automatically remain singletons.
A website to compute the similarity and to perform clustering, the source code and Supplementary Material are available at http://mosta.molgen.mpg.de.
转录因子(TFs)通过与靶序列结合在基因调控中发挥关键作用。在计算生物学中,对TF与结合位点潜在结合的计算机模拟预测是一个研究充分的问题。一个TF的结合位点由位置频率矩阵(PFM)表示。发现新的PFM需要与已知的PFM进行比较以避免冗余。一般来说,如果在零模型下两个PFM出现在重叠位置,则它们是相似的。然而,大多数现有方法根据PFM的概率距离来计算相似度。在此,我们提出一种基于纳入两条链的PFM命中数之间的渐近协方差的自然相似度度量。此外,我们基于相同的想法引入第二种度量来对一组Jaspar PFM进行聚类。
我们表明,渐近协方差可以通过得分分布的二维卷积有效地计算。渐近协方差方法与模拟数据显示出很强的相关性。它优于三种替代方法。Jaspar聚类产生了同一类别的不同TF组。此外,为每个类别给出了一个代表性的PFM。与大多数其他聚类方法不同,相似度低的PFM会自动保持为单例。
一个用于计算相似度和进行聚类的网站,源代码和补充材料可在http://mosta.molgen.mpg.de获取。