Zhou Tianyin, Shen Ning, Yang Lin, Abe Namiko, Horton John, Mann Richard S, Bussemaker Harmen J, Gordân Raluca, Rohs Remo
Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089;
Departments of Pharmacology and Cancer Biology and Center for Genomic and Computational Biology, Duke University, Durham, NC 27708;
Proc Natl Acad Sci U S A. 2015 Apr 14;112(15):4654-9. doi: 10.1073/pnas.1422023112. Epub 2015 Mar 9.
DNA binding specificities of transcription factors (TFs) are a key component of gene regulatory processes. Underlying mechanisms that explain the highly specific binding of TFs to their genomic target sites are poorly understood. A better understanding of TF-DNA binding requires the ability to quantitatively model TF binding to accessible DNA as its basic step, before additional in vivo components can be considered. Traditionally, these models were built based on nucleotide sequence. Here, we integrated 3D DNA shape information derived with a high-throughput approach into the modeling of TF binding specificities. Using support vector regression, we trained quantitative models of TF binding specificity based on protein binding microarray (PBM) data for 68 mammalian TFs. The evaluation of our models included cross-validation on specific PBM array designs, testing across different PBM array designs, and using PBM-trained models to predict relative binding affinities derived from in vitro selection combined with deep sequencing (SELEX-seq). Our results showed that shape-augmented models compared favorably to sequence-based models. Although both k-mer and DNA shape features can encode interdependencies between nucleotide positions of the binding site, using DNA shape features reduced the dimensionality of the feature space. In addition, analyzing the feature weights of DNA shape-augmented models uncovered TF family-specific structural readout mechanisms that were not revealed by the DNA sequence. As such, this work combines knowledge from structural biology and genomics, and suggests a new path toward understanding TF binding and genome function.
转录因子(TFs)的DNA结合特异性是基因调控过程的关键组成部分。目前对于解释TFs与其基因组靶位点高度特异性结合的潜在机制仍知之甚少。要更好地理解TF-DNA结合,首先需要能够将TF与可及DNA的结合进行定量建模,然后才能考虑其他体内成分。传统上,这些模型是基于核苷酸序列构建的。在此,我们将通过高通量方法获得的三维DNA形状信息整合到TF结合特异性的建模中。我们使用支持向量回归,基于68种哺乳动物TFs的蛋白质结合微阵列(PBM)数据,训练了TF结合特异性的定量模型。对我们模型的评估包括在特定PBM阵列设计上的交叉验证、跨不同PBM阵列设计的测试,以及使用经PBM训练的模型预测从体外筛选结合深度测序(SELEX-seq)获得的相对结合亲和力。我们的结果表明,形状增强模型优于基于序列的模型。虽然k-mer和DNA形状特征都可以编码结合位点核苷酸位置之间的相互依赖性,但使用DNA形状特征降低了特征空间的维度。此外,分析DNA形状增强模型的特征权重揭示了基于DNA序列未发现的TF家族特异性结构读出机制。因此,这项工作结合了结构生物学和基因组学的知识,并为理解TF结合和基因组功能提出了一条新途径。