Computational Biology Program, Memorial Sloan-Kettering Cancer Center, New York, New York 10065, USA.
Genome Res. 2012 Sep;22(9):1723-34. doi: 10.1101/gr.127712.111.
Gene regulatory programs in distinct cell types are maintained in large part through the cell-type-specific binding of transcription factors (TFs). The determinants of TF binding include direct DNA sequence preferences, DNA sequence preferences of cofactors, and the local cell-dependent chromatin context. To explore the contribution of DNA sequence signal, histone modifications, and DNase accessibility to cell-type-specific binding, we analyzed 286 ChIP-seq experiments performed by the ENCODE Consortium. This analysis included experiments for 67 transcriptional regulators, 15 of which were profiled in both the GM12878 (lymphoblastoid) and K562 (erythroleukemic) human hematopoietic cell lines. To model TF-bound regions, we trained support vector machines (SVMs) that use flexible k-mer patterns to capture DNA sequence signals more accurately than traditional motif approaches. In addition, we trained SVM spatial chromatin signatures to model local histone modifications and DNase accessibility, obtaining significantly more accurate TF occupancy predictions than simpler approaches. Consistent with previous studies, we find that DNase accessibility can explain cell-line-specific binding for many factors. However, we also find that of the 10 factors with prominent cell-type-specific binding patterns, four display distinct cell-type-specific DNA sequence preferences according to our models. Moreover, for two factors we identify cell-specific binding sites that are accessible in both cell types but bound only in one. For these sites, cell-type-specific sequence models, rather than DNase accessibility, are better able to explain differential binding. Our results suggest that using a single motif for each TF and filtering for chromatin accessible loci is not always sufficient to accurately account for cell-type-specific binding profiles.
不同细胞类型的基因调控程序在很大程度上是通过转录因子(TFs)的细胞类型特异性结合来维持的。TF 结合的决定因素包括直接的 DNA 序列偏好、共因子的 DNA 序列偏好以及局部细胞依赖的染色质背景。为了探索 DNA 序列信号、组蛋白修饰和 DNase 可及性对细胞类型特异性结合的贡献,我们分析了 ENCODE 联盟进行的 286 项 ChIP-seq 实验。该分析包括 67 个转录调节剂的实验,其中 15 个在 GM12878(淋巴母细胞)和 K562(红白血病)人类造血细胞系中进行了分析。为了模拟 TF 结合区域,我们训练了支持向量机(SVM),这些 SVM 使用灵活的 k-mer 模式来比传统的基序方法更准确地捕获 DNA 序列信号。此外,我们还训练了 SVM 空间染色质特征来模拟局部组蛋白修饰和 DNase 可及性,从而获得比简单方法更准确的 TF 占据预测。与先前的研究一致,我们发现 DNase 可及性可以解释许多因素的细胞系特异性结合。然而,我们还发现,在具有明显细胞类型特异性结合模式的 10 个因素中,根据我们的模型,有四个因素显示出独特的细胞类型特异性 DNA 序列偏好。此外,对于两个我们确定的因素,我们鉴定了在两种细胞类型中都可及但仅在一种细胞类型中结合的细胞特异性结合位点。对于这些位点,细胞类型特异性序列模型而不是 DNase 可及性更能够解释差异结合。我们的结果表明,对于每个 TF 使用单个基序并过滤染色质可及性位点并不总是足以准确解释细胞类型特异性结合谱。