Department of Chemistry, University of Chicago, Chicago, IL 60637, USA.
Nucleic Acids Res. 2012 Dec;40(22):e175. doi: 10.1093/nar/gks771. Epub 2012 Aug 25.
Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.
预测转录因子结合位点(TFBS)的典型方法包括使用位置特异性权重矩阵(PWM)来统计表征已知位点的序列。最近,提出了一种替代的物理化学方法,称为 SiteSleuth。在这种方法中,线性支持向量机(SVM)分类器经过训练,可以根据 DNA 的局部化学和结构特征,将 TFBS 与背景序列区分开来。SiteSleuth 的性能似乎普遍优于基于 PWM 的方法。在这里,我们通过考虑新的物理化学特征和算法修改来改进 SiteSleuth 方法。新特征源自氨基酸-DNA 相互作用的吉布斯能和 DNA 的羟基自由基切割谱。算法修改包括包含特征选择步骤、在 SVM 分类器中使用非线性核以及使用基于共识的预测后处理步骤。我们还考虑了仅基于字母特征的 SVM 分类,以区分使用 SVM 模型与使用物理化学特征的性能提升。通过使用 RegulonDB 数据库中 54 个大肠杆菌 TF 的可用数据进行交叉验证,以及使用已发表的 Fis 和 Lrp 的 ChIP-chip 数据进行实验验证,评估了所考虑的每种变体方法的准确性。