Kumar Sunil, Bucher Philipp
Swiss Institute for Experimental Cancer Research (ISREC), School of Life Sciences, EPFL, Station 15, Lausanne, CH-1015, Switzerland.
Swiss Institute of Bioinformatics (SIB), EPFL, Station 15, Lausanne, CH-1015, Switzerland.
BMC Bioinformatics. 2016 Jan 11;17 Suppl 1(Suppl 1):4. doi: 10.1186/s12859-015-0846-z.
Understanding the mechanisms by which transcription factors (TF) are recruited to their physiological target sites is crucial for understanding gene regulation. DNA sequence intrinsic features such as predicted binding affinity are often not very effective in predicting in vivo site occupancy and in any case could not explain cell-type specific binding events. Recent reports show that chromatin accessibility, nucleosome occupancy and specific histone post-translational modifications greatly influence TF site occupancy in vivo. In this work, we use machine-learning methods to build predictive models and assess the relative importance of different sequence-intrinsic and chromatin features in the TF-to-target-site recruitment process.
Our study primarily relies on recent data published by the ENCODE consortium. Five dissimilar TFs assayed in multiple cell-types were selected as examples: CTCF, JunD, REST, GABP and USF2. We used two types of candidate target sites: (a) predicted sites obtained by scanning the whole genome with a position weight matrix, and (b) cell-type specific peak lists provided by ENCODE. Quantitative in vivo occupancy levels in different cell-types were based on ChIP-seq data for the corresponding TFs. In parallel, we computed a number of associated sequence-intrinsic and experimental features (histone modification, DNase I hypersensitivity, etc.) for each site. Machine learning algorithms were then used in a binary classification and regression framework to predict site occupancy and binding strength, for the purpose of assessing the relative importance of different contextual features.
We observed striking differences in the feature importance rankings between the five factors tested. PWM-scores were amongst the most important features only for CTCF and REST but of little value for JunD and USF2. Chromatin accessibility and active histone marks are potent predictors for all factors except REST. Structural DNA parameters, repressive and gene body associated histone marks are generally of little or no predictive value.
We define a general and extensible computational framework for analyzing the importance of various DNA-intrinsic and chromatin-associated features in determining cell-type specific TF binding to target sites. The application of our methodology to ENCODE data has led to new insights on transcription regulatory processes and may serve as example for future studies encompassing even larger datasets.
了解转录因子(TF)被招募到其生理靶位点的机制对于理解基因调控至关重要。DNA序列的内在特征,如预测的结合亲和力,在预测体内位点占有率方面往往不是很有效,而且在任何情况下都无法解释细胞类型特异性结合事件。最近的报告表明,染色质可及性、核小体占有率和特定的组蛋白翻译后修饰在很大程度上影响TF在体内的位点占有率。在这项工作中,我们使用机器学习方法构建预测模型,并评估不同序列内在特征和染色质特征在TF到靶位点招募过程中的相对重要性。
我们的研究主要依赖于ENCODE联盟最近发布的数据。选择了在多种细胞类型中检测的五种不同的TF作为示例:CTCF、JunD、REST、GABP和USF2。我们使用了两种类型的候选靶位点:(a)通过用位置权重矩阵扫描全基因组获得的预测位点,以及(b)ENCODE提供的细胞类型特异性峰列表。不同细胞类型中的定量体内占有率水平基于相应TF的ChIP-seq数据。同时,我们为每个位点计算了许多相关的序列内在特征和实验特征(组蛋白修饰、DNase I超敏反应等)。然后,在二元分类和回归框架中使用机器学习算法来预测位点占有率和结合强度,以评估不同背景特征的相对重要性。
我们观察到所测试的五个因子之间在特征重要性排名上存在显著差异。PWM分数仅是CTCF和REST最重要的特征之一,对JunD和USF2几乎没有价值。除REST外,染色质可及性和活性组蛋白标记是所有因子的有效预测指标。结构DNA参数、抑制性和基因体相关的组蛋白标记通常几乎没有或没有预测价值。
我们定义了一个通用且可扩展的计算框架,用于分析各种DNA内在特征和染色质相关特征在确定细胞类型特异性TF与靶位点结合中的重要性。我们的方法应用于ENCODE数据,为转录调控过程带来了新的见解,并可能为未来包含更大数据集的研究提供示例。