School of Computer Science and Engineering, Central South University, Changsha 410075, China.
School of Software, Xinjiang University, Urumqi 830008, China.
Molecules. 2019 Dec 26;25(1):98. doi: 10.3390/molecules25010098.
Interactions between proteins and DNAs play essential roles in many biological processes. DNA binding proteins can be classified into two categories. Double-stranded DNA-binding proteins (DSBs) bind to double-stranded DNA and are involved in a series of cell functions such as gene expression and regulation. Single-stranded DNA-binding proteins (SSBs) are necessary for DNA replication, recombination, and repair and are responsible for binding to the single-stranded DNA. Therefore, the effective classification of DNA-binding proteins is helpful for functional annotations of proteins. In this work, we propose PredPSD, a computational method based on sequence information that accurately predicts SSBs and DSBs. It introduces three novel feature extraction algorithms. In particular, we use the autocross-covariance (ACC) transformation to transform feature matrices into fixed-length vectors. Then, we put the optimal feature subset obtained by the minimal-redundancy-maximal-relevance criterion (mRMR) feature selection algorithm into the gradient tree boosting (GTB). In 10-fold cross-validation based on a benchmark dataset, PredPSD achieves promising performances with an AUC score of 0.956 and an accuracy of 0.912, which are better than those of existing methods. Moreover, our method has significantly improved the prediction accuracy in independent testing. The experimental results show that PredPSD can significantly recognize the binding specificity and differentiate DSBs and SSBs.
蛋白质与 DNA 的相互作用在许多生物过程中起着至关重要的作用。DNA 结合蛋白可以分为两类。双链 DNA 结合蛋白(DSBs)与双链 DNA 结合,参与一系列细胞功能,如基因表达和调控。单链 DNA 结合蛋白(SSBs)是 DNA 复制、重组和修复所必需的,负责与单链 DNA 结合。因此,对 DNA 结合蛋白进行有效的分类有助于对蛋白质进行功能注释。在这项工作中,我们提出了 PredPSD,这是一种基于序列信息的计算方法,可以准确预测 SSBs 和 DSBs。它引入了三种新的特征提取算法。特别是,我们使用自协方差(ACC)变换将特征矩阵转换为固定长度的向量。然后,我们将最小冗余最大相关性准则(mRMR)特征选择算法得到的最优特征子集放入梯度提升树(GTB)中。在基于基准数据集的 10 折交叉验证中,PredPSD 取得了有希望的性能,AUC 得分为 0.956,准确率为 0.912,优于现有方法。此外,我们的方法在独立测试中显著提高了预测精度。实验结果表明,PredPSD 可以显著识别结合特异性,并区分 DSBs 和 SSBs。