Department of Statistics, University of California Riverside, Riverside, CA 92521, USA.
Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
Bioinformatics. 2017 Oct 1;33(19):3003-3010. doi: 10.1093/bioinformatics/btx336.
Transcription factors (TFs) bind to specific DNA sequence motifs. Several lines of evidence suggest that TF-DNA binding is mediated in part by properties of the local DNA shape: the width of the minor groove, the relative orientations of adjacent base pairs, etc. Several methods have been developed to jointly account for DNA sequence and shape properties in predicting TF binding affinity. However, a limitation of these methods is that they typically require a training set of aligned TF binding sites.
We describe a sequence + shape kernel that leverages DNA sequence and shape information to better understand protein-DNA binding preference and affinity. This kernel extends an existing class of k-mer based sequence kernels, based on the recently described di-mismatch kernel. Using three in vitro benchmark datasets, derived from universal protein binding microarrays (uPBMs), genomic context PBMs (gcPBMs) and SELEX-seq data, we demonstrate that incorporating DNA shape information improves our ability to predict protein-DNA binding affinity. In particular, we observe that (i) the k-spectrum + shape model performs better than the classical k-spectrum kernel, particularly for small k values; (ii) the di-mismatch kernel performs better than the k-mer kernel, for larger k; and (iii) the di-mismatch + shape kernel performs better than the di-mismatch kernel for intermediate k values.
The software is available at https://bitbucket.org/wenxiu/sequence-shape.git.
rohs@usc.edu or william-noble@uw.edu.
Supplementary data are available at Bioinformatics online.
转录因子(TFs)与特定的 DNA 序列基序结合。有几条证据表明,TF-DNA 结合部分是由局部 DNA 形状的特性介导的:小沟的宽度、相邻碱基对的相对取向等。已经开发了几种方法来联合考虑 DNA 序列和形状特性,以预测 TF 结合亲和力。然而,这些方法的一个局限性是它们通常需要一组对齐的 TF 结合位点的训练集。
我们描述了一种序列+形状核函数,该核函数利用 DNA 序列和形状信息来更好地理解蛋白质-DNA 结合偏好和亲和力。该核函数扩展了基于最近描述的二错配核的现有 k-mer 基序列核函数类。使用三个体外基准数据集,源自通用蛋白质结合微阵列(uPBMs)、基因组上下文 PBMs(gcPBMs)和 SELEX-seq 数据,我们证明了包含 DNA 形状信息可以提高我们预测蛋白质-DNA 结合亲和力的能力。特别是,我们观察到:(i)k-光谱+形状模型比经典的 k-光谱核函数表现更好,特别是对于较小的 k 值;(ii)二错配核函数比 k-mer 核函数表现更好,对于较大的 k 值;(iii)二错配+形状核函数在中间 k 值上比二错配核函数表现更好。
软件可在 https://bitbucket.org/wenxiu/sequence-shape.git 获得。
rohs@usc.edu 或 william-noble@uw.edu。
补充数据可在生物信息学在线获得。