Ma Xin, Guo Jing, Sun Xiao
School of Science, Nanjing Audit University, Nanjing, China.
State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China.
PLoS One. 2016 Dec 1;11(12):e0167345. doi: 10.1371/journal.pone.0167345. eCollection 2016.
DNA-binding proteins are fundamentally important in cellular processes. Several computational-based methods have been developed to improve the prediction of DNA-binding proteins in previous years. However, insufficient work has been done on the prediction of DNA-binding proteins from protein sequence information. In this paper, a novel predictor, DNABP (DNA-binding proteins), was designed to predict DNA-binding proteins using the random forest (RF) classifier with a hybrid feature. The hybrid feature contains two types of novel sequence features, which reflect information about the conservation of physicochemical properties of the amino acids, and the binding propensity of DNA-binding residues and non-binding propensities of non-binding residues. The comparisons with each feature demonstrated that these two novel features contributed most to the improvement in predictive ability. Furthermore, to improve the prediction performance of the DNABP model, feature selection using the minimum redundancy maximum relevance (mRMR) method combined with incremental feature selection (IFS) was carried out during the model construction. The results showed that the DNABP model could achieve 86.90% accuracy, 83.76% sensitivity, 90.03% specificity and a Matthews correlation coefficient of 0.727. High prediction accuracy and performance comparisons with previous research suggested that DNABP could be a useful approach to identify DNA-binding proteins from sequence information. The DNABP web server system is freely available at http://www.cbi.seu.edu.cn/DNABP/.
DNA结合蛋白在细胞过程中至关重要。近年来,已经开发了几种基于计算的方法来改进DNA结合蛋白的预测。然而,从蛋白质序列信息预测DNA结合蛋白方面的工作还不够充分。本文设计了一种新型预测器DNABP(DNA结合蛋白),使用具有混合特征的随机森林(RF)分类器来预测DNA结合蛋白。混合特征包含两种新型序列特征,它们反映了氨基酸物理化学性质的保守性信息,以及DNA结合残基的结合倾向和非结合残基的非结合倾向。与每个特征的比较表明,这两种新特征对预测能力的提高贡献最大。此外,为了提高DNABP模型的预测性能,在模型构建过程中使用了最小冗余最大相关(mRMR)方法结合增量特征选择(IFS)进行特征选择。结果表明,DNABP模型的准确率可达86.90%,灵敏度为83.76%,特异性为90.03%,马修斯相关系数为0.727。高预测准确率以及与先前研究的性能比较表明,DNABP可能是一种从序列信息中识别DNA结合蛋白的有用方法。DNABP网络服务器系统可在http://www.cbi.seu.edu.cn/DNABP/免费获取。