Vorontsov Ilya E, Kulakovskiy Ivan V, Makeev Vsevolod J
Department of Computational Systems Biology, Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina str. 3, Moscow 119991, GSP-1, Russia.
Data Analysis Department, Yandex Data Analysis School, Moscow Institute of Physics and Technology, Leo Tolstoy str. 16, Moscow 119021, Russia.
Algorithms Mol Biol. 2013 Sep 30;8(1):23. doi: 10.1186/1748-7188-8-23.
Positional weight matrix (PWM) remains the most popular for quantification of transcription factor (TF) binding. PWM supplied with a score threshold defines a set of putative transcription factor binding sites (TFBS), thus providing a TFBS model.TF binding DNA fragments obtained by different experimental methods usually give similar but not identical PWMs. This is also common for different TFs from the same structural family. Thus it is often necessary to measure the similarity between PWMs. The popular tools compare PWMs directly using matrix elements. Yet, for log-odds PWMs, negative elements do not contribute to the scores of highly scoring TFBS and thus may be different without affecting the sets of the best recognized binding sites. Moreover, the two TFBS sets recognized by a given pair of PWMs can be more or less different depending on the score thresholds.
We propose a practical approach for comparing two TFBS models, each consisting of a PWM and the respective scoring threshold. The proposed measure is a variant of the Jaccard index between two TFBS sets. The measure defines a metric space for TFBS models of all finite lengths. The algorithm can compare TFBS models constructed using substantially different approaches, like PWMs with raw positional counts and log-odds. We present the efficient software implementation: MACRO-APE (MAtrix CompaRisOn by Approximate P-value Estimation).
MACRO-APE can be effectively used to compute the Jaccard index based similarity for two TFBS models. A two-pass scanning algorithm is presented to scan a given collection of PWMs for PWMs similar to a given query.
MACRO-APE is implemented in ruby 1.9; software including source code and a manual is freely available at http://autosome.ru/macroape/ and in supplementary materials.
位置权重矩阵(PWM)仍然是用于量化转录因子(TF)结合的最常用方法。带有得分阈值的PWM定义了一组推定的转录因子结合位点(TFBS),从而提供了一个TFBS模型。通过不同实验方法获得的TF结合DNA片段通常会给出相似但不完全相同的PWM。对于来自同一结构家族的不同TF,情况也是如此。因此,经常需要测量PWM之间的相似性。流行的工具直接使用矩阵元素比较PWM。然而,对于对数几率PWM,负元素对高分TFBS的得分没有贡献,因此在不影响最佳识别结合位点集的情况下可能会有所不同。此外,取决于得分阈值,由给定的一对PWM识别的两个TFBS集可能或多或少有所不同。
我们提出了一种实用的方法来比较两个TFBS模型,每个模型由一个PWM和各自的得分阈值组成。所提出的度量是两个TFBS集之间杰卡德指数的一种变体。该度量为所有有限长度的TFBS模型定义了一个度量空间。该算法可以比较使用实质上不同的方法构建的TFBS模型,如具有原始位置计数和对数几率的PWM。我们展示了高效的软件实现:MACRO - APE(通过近似P值估计进行矩阵比较)。
MACRO - APE可有效地用于计算两个TFBS模型基于杰卡德指数的相似性。提出了一种两遍扫描算法,用于在给定的PWM集合中扫描与给定查询相似的PWM。
MACRO - APE用ruby 1.9实现;软件包括源代码和手册可在http://autosome.ru/macroape/以及补充材料中免费获取。