Department of Genetics and Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, 63110, USA.
BMC Bioinformatics. 2018 Mar 6;19(1):86. doi: 10.1186/s12859-018-2104-7.
Transcription factor (TF) binding site specificity is commonly represented by some form of matrix model in which the positions in the binding site are assumed to contribute independently to the site's activity. The independence assumption is known to be an approximation, often a good one but sometimes poor. Alternative approaches have been developed that use k-mers (DNA "words" of length k) to account for the non-independence, and more recently DNA structural parameters have been incorporated into the models. ChIP-seq data are often used to assess the discriminatory power of motifs and to compare different models. However, to measure the improvement due to using more complex models, one must compare to optimized matrix models.
We describe a program "Discriminative Additive Model Optimization" (DAMO) that uses positive and negative examples, as in ChIP-seq data, and finds the additive position weight matrix (PWM) that maximizes the Area Under the Receiver Operating Characteristic Curve (AUROC). We compare to a recent study where structural parameters, serving as features in a gradient boosting classifier algorithm, are shown to improve the AUROC over JASPAR position frequency matrices (PFMs). In agreement with the previous results, we find that adding structural parameters gives the largest improvement, but most of the gain can be obtained by an optimized PWM and nearly all of the gain can be obtained with a di-nucleotide extension to the PWM.
To appropriately compare different models for TF bind sites, optimized models must be used. PWMs and their extensions are good representations of binding specificity for most TFs, and more complex models, including the incorporation of DNA shape features and gradient boosting classifiers, provide only moderate improvements for a few TFs.
转录因子 (TF) 结合位点特异性通常以某种形式的矩阵模型表示,其中假定结合位点中的位置独立地对位点的活性作出贡献。已知独立性假设是一种近似,通常是很好的,但有时也很差。已经开发出了替代方法,使用 k-mers(长度为 k 的 DNA“单词”)来解释非独立性,并且最近已经将 DNA 结构参数纳入到模型中。ChIP-seq 数据通常用于评估基序的辨别能力并比较不同的模型。然而,要衡量使用更复杂的模型带来的改进,必须与优化的矩阵模型进行比较。
我们描述了一个程序“Discriminative Additive Model Optimization”(DAMO),该程序使用阳性和阴性示例(如 ChIP-seq 数据),并找到可最大化接收者操作特征曲线(AUROC)下面积的加性位置权重矩阵(PWM)。我们将其与最近的一项研究进行了比较,该研究表明,结构参数作为梯度提升分类器算法中的特征,可以提高 AUROC 超过 JASPAR 位置频率矩阵(PFMs)。与之前的结果一致,我们发现添加结构参数可带来最大的改进,但通过优化 PWM 可以获得大部分增益,并且通过 PWM 的二核苷酸扩展几乎可以获得全部增益。
为了适当地比较 TF 结合位点的不同模型,必须使用优化模型。PWM 及其扩展对于大多数 TF 是很好的结合特异性表示,而更复杂的模型,包括 DNA 形状特征和梯度提升分类器的纳入,仅对少数 TF 提供适度的改进。