Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, Knowledge Systems and Informatics, Lawrence Livermore National Laboratory, Livermore, CA 94550 and Department of Computer Science, Duke University, Durham, NC 27708, USA.
Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, Knowledge Systems and Informatics, Lawrence Livermore National Laboratory, Livermore, CA 94550 and Department of Computer Science, Duke University, Durham, NC 27708, USA Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, Knowledge Systems and Informatics, Lawrence Livermore National Laboratory, Livermore, CA 94550 and Department of Computer Science, Duke University, Durham, NC 27708, USA.
Bioinformatics. 2014 Oct 15;30(20):2868-74. doi: 10.1093/bioinformatics/btu408. Epub 2014 Jun 27.
Transcriptional regulation is directly enacted by the interactions between DNA and many proteins, including transcription factors (TFs), nucleosomes and polymerases. A critical step in deciphering transcriptional regulation is to infer, and eventually predict, the precise locations of these interactions, along with their strength and frequency. While recent datasets yield great insight into these interactions, individual data sources often provide only partial information regarding one aspect of the complete interaction landscape. For example, chromatin immunoprecipitation (ChIP) reveals the binding positions of a protein, but only for one protein at a time. In contrast, nucleases like MNase and DNase can be used to reveal binding positions for many different proteins at once, but cannot easily determine the identities of those proteins. Currently, few statistical frameworks jointly model these different data sources to reveal an accurate, holistic view of the in vivo protein-DNA interaction landscape.
Here, we develop a novel statistical framework that integrates different sources of experimental information within a thermodynamic model of competitive binding to jointly learn a holistic view of the in vivo protein-DNA interaction landscape. We show that our framework learns an interaction landscape with increased accuracy, explaining multiple sets of data in accordance with thermodynamic principles of competitive DNA binding. The resulting model of genomic occupancy provides a precise mechanistic vantage point from which to explore the role of protein-DNA interactions in transcriptional regulation.
The C source code for compete and Python source code for MCMC-based inference are available at http://www.cs.duke.edu/∼amink.
Supplementary data are available at Bioinformatics online.
转录调控是由 DNA 与许多蛋白质(包括转录因子 (TFs)、核小体和聚合酶)之间的相互作用直接实施的。破译转录调控的关键步骤是推断(并最终预测)这些相互作用的确切位置,以及它们的强度和频率。虽然最近的数据集为这些相互作用提供了很好的见解,但单个数据源通常仅提供有关完整相互作用景观的一个方面的部分信息。例如,染色质免疫沉淀 (ChIP) 揭示了蛋白质的结合位置,但一次只能揭示一种蛋白质的位置。相比之下,核酸酶如 MNase 和 DNase 可用于同时揭示许多不同蛋白质的结合位置,但不能轻易确定这些蛋白质的身份。目前,很少有统计框架联合这些不同的数据源来揭示体内蛋白质-DNA 相互作用景观的准确、整体视图。
在这里,我们开发了一种新的统计框架,该框架在竞争结合的热力学模型内整合了不同来源的实验信息,以共同学习体内蛋白质-DNA 相互作用景观的整体视图。我们表明,我们的框架以更高的准确性学习相互作用景观,根据竞争 DNA 结合的热力学原理解释多组数据。由此产生的基因组占有率模型提供了一个精确的机械视角,可从中探索蛋白质-DNA 相互作用在转录调控中的作用。
compete 的 C 源代码和基于 MCMC 推断的 Python 源代码可在 http://www.cs.duke.edu/∼amink 获得。
补充数据可在生物信息学在线获得。