Santolini Marc, Mora Thierry, Hakim Vincent
Laboratoire de Physique Statistique, CNRS, Université P. et M. Curie, Université D. Diderot, École Normale Supérieure, Paris, France.
PLoS One. 2014 Jun 13;9(6):e99015. doi: 10.1371/journal.pone.0099015. eCollection 2014.
The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair contributes independently to the transcription factor (TF) binding. However, this description ignores correlations between nucleotides at different positions, and is generally inaccurate: analysing fly and mouse in vivo ChIPseq data, we show that in most cases the PWM model fails to reproduce the observed statistics of TFBSs. To overcome this issue, we introduce the pairwise interaction model (PIM), a generalization of the PWM model. The model is based on the principle of maximum entropy and explicitly describes pairwise correlations between nucleotides at different positions, while being otherwise as unconstrained as possible. It is mathematically equivalent to considering a TF-DNA binding energy that depends additively on each nucleotide identity at all positions in the TFBS, like the PWM model, but also additively on pairs of nucleotides. We find that the PIM significantly improves over the PWM model, and even provides an optimal description of TFBS statistics within statistical noise. The PIM generalizes previous approaches to interdependent positions: it accounts for co-variation of two or more base pairs, and predicts secondary motifs, while outperforming multiple-motif models consisting of mixtures of PWMs. We analyse the structure of pairwise interactions between nucleotides, and find that they are sparse and dominantly located between consecutive base pairs in the flanking region of TFBS. Nonetheless, interactions between pairs of non-consecutive nucleotides are found to play a significant role in the obtained accurate description of TFBS statistics. The PIM is computationally tractable, and provides a general framework that should be useful for describing and predicting TFBSs beyond PWMs.
识别基因组DNA上的转录因子结合位点(TFBS)对于理解和预测基因网络中的调控元件至关重要。TFBS基序通常由位置权重矩阵(PWM)描述,其中每个DNA碱基对独立地对转录因子(TF)结合做出贡献。然而,这种描述忽略了不同位置核苷酸之间的相关性,并且通常不准确:通过分析果蝇和小鼠的体内ChIPseq数据,我们表明在大多数情况下,PWM模型无法重现观察到的TFBS统计数据。为了克服这个问题,我们引入了成对相互作用模型(PIM),它是PWM模型的推广。该模型基于最大熵原理,明确描述了不同位置核苷酸之间的成对相关性,同时尽可能不受其他约束。它在数学上等同于考虑一种TF-DNA结合能,该结合能像PWM模型一样,对TFBS中所有位置的每个核苷酸身份进行加法依赖,但也对核苷酸对进行加法依赖。我们发现PIM比PWM模型有显著改进,甚至在统计噪声范围内提供了TFBS统计数据的最优描述。PIM将先前针对相互依赖位置的方法进行了推广:它考虑了两个或更多碱基对的共变,并预测二级基序,同时优于由PWM混合物组成的多基序模型。我们分析了核苷酸之间成对相互作用的结构,发现它们是稀疏的,并且主要位于TFBS侧翼区域的连续碱基对之间。尽管如此,发现非连续核苷酸对之间的相互作用在获得的TFBS统计数据的准确描述中起着重要作用。PIM在计算上易于处理,并提供了一个通用框架,该框架对于描述和预测超越PWM的TFBS应该是有用的。