Sir Harold Mitchell Building, School of Biology, University of St Andrews, St Andrews, Fife, KY16 9TH, UK.
Nucleic Acids Res. 2013 Jun;41(11):5582-93. doi: 10.1093/nar/gkt260. Epub 2013 Apr 17.
Genome-wide prediction of transcription factor binding sites is notoriously difficult. We have developed and applied a logistic regression approach for prediction of binding sites for the p53 transcription factor that incorporates sequence information and chromatin modification data. We tested this by comparison of predicted sites with known binding sites defined by chromatin immunoprecipitation (ChIP), by the location of predictions relative to genes, by the function of nearby genes and by analysis of gene expression data after p53 activation. We compared the predictions made by our novel model with predictions based only on matches to a sequence position weight matrix (PWM). In whole genome assays, the fraction of known sites identified by the two models was similar, suggesting that there was little to be gained from including chromatin modification data. In contrast, there were highly significant and biologically relevant differences between the two models in the location of the predicted binding sites relative to genes, in the function of nearby genes and in the responsiveness of nearby genes to p53 activation. We propose that these contradictory results can be explained by PWM and ChIP data reflecting primarily biophysical properties of protein-DNA interactions, whereas chromatin modification data capture biologically important functional information.
全基因组预测转录因子结合位点是一项极具挑战性的工作。我们开发并应用了一种逻辑回归方法,用于预测 p53 转录因子的结合位点,该方法结合了序列信息和染色质修饰数据。我们通过比较预测的结合位点与染色质免疫沉淀(ChIP)定义的已知结合位点、预测的位置相对于基因的位置、附近基因的功能以及 p53 激活后基因表达数据的分析来验证这种方法。我们将我们的新模型的预测结果与仅基于与序列位置权重矩阵(PWM)匹配的预测结果进行了比较。在全基因组检测中,两种模型识别的已知位点的比例相似,这表明包含染色质修饰数据几乎没有什么好处。相比之下,在预测的结合位点相对于基因的位置、附近基因的功能以及附近基因对 p53 激活的反应性方面,两种模型之间存在高度显著且具有生物学意义的差异。我们提出,这些矛盾的结果可以用 PWM 和 ChIP 数据主要反映蛋白质-DNA 相互作用的物理性质来解释,而染色质修饰数据则可以捕捉到具有生物学意义的功能信息。