Liu Bin, Xu Jinghao, Fan Shixi, Xu Ruifeng, Zhou Jiyun, Wang Xiaolong
School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, P.R. China.
Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, P.R. China.
Mol Inform. 2015 Jan;34(1):8-17. doi: 10.1002/minf.201400025. Epub 2014 Sep 26.
Identification of DNA-binding proteins is an important problem in biomedical research as DNA-binding proteins are crucial for various cellular processes. Currently, the machine learning methods achieve the-state-of-the-art performance with different features. A key step to improve the performance of these methods is to find a suitable representation of proteins. In this study, we proposed a feature vector composed of three kinds of sequence-based features, including overall amino acid composition, pseudo amino acid composition (PseAAC) proposed by Chou and physicochemical distance transformation. These features not only consider the sequence composition of proteins, but also incorporate the sequence-order information of amino acids in proteins. The feature vectors were fed into Support Vector Machine (SVM) for DNA-binding protein identification. The proposed method is called PseDNA-Pro. Experiments on stringent benchmark datasets and independent test datasets by using the Jackknife test showed that PseDNA-Pro can achieve an accuracy of higher than 80 %, outperforming several state-of-the-art methods, including DNAbinder, DNA-Prot, and iDNA-Prot. These results indicate that the combination of various features for DNA-binding protein prediction is a suitable approach, and the sequence-order information among residues in proteins is relative for discrimination. For practical applications, a web-server of PseDNA-Pro was established, which is available from http://bioinformatics.hitsz.edu.cn/PseDNA-Pro/.
识别DNA结合蛋白是生物医学研究中的一个重要问题,因为DNA结合蛋白对各种细胞过程至关重要。目前,机器学习方法利用不同特征实现了最先进的性能。提高这些方法性能的关键步骤是找到一种合适的蛋白质表示方法。在本研究中,我们提出了一种由三种基于序列的特征组成的特征向量,包括整体氨基酸组成、Chou提出的伪氨基酸组成(PseAAC)和物理化学距离变换。这些特征不仅考虑了蛋白质的序列组成,还纳入了蛋白质中氨基酸的序列顺序信息。将特征向量输入支持向量机(SVM)进行DNA结合蛋白识别。所提出的方法称为PseDNA-Pro。通过留一法在严格的基准数据集和独立测试数据集上进行的实验表明,PseDNA-Pro可以达到高于80%的准确率,优于包括DNAbinder、DNA-Prot和iDNA-Prot在内的几种最先进的方法。这些结果表明,结合多种特征进行DNA结合蛋白预测是一种合适的方法,并且蛋白质中残基之间的序列顺序信息对于区分是相关的。对于实际应用,建立了PseDNA-Pro的网络服务器,可从http://bioinformatics.hitsz.edu.cn/PseDNA-Pro/获取。