Liu Bin, Wang Shanyi, Dong Qiwen, Li Shumin, Liu Xuan
IEEE Trans Nanobioscience. 2016 Jun;15(4):328-334. doi: 10.1109/TNB.2016.2555951. Epub 2016 Apr 20.
DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. With the rapid development of next generation of sequencing technique, the number of protein sequences is unprecedentedly increasing. Thus it is necessary to develop computational methods to identify the DNA-binding proteins only based on the protein sequence information. In this study, a novel method called iDNA-KACC is presented, which combines the Support Vector Machine (SVM) and the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. The sequence order effect can be effectively captured by this scheme. These vectors are then fed into Support Vector Machine (SVM) to discriminate the DNA-binding proteins from the non DNA-binding ones. iDNA-KACC achieves an overall accuracy of 75.16% and Matthew correlation coefficient of 0.5 by a rigorous jackknife test. Its performance is further improved by employing an ensemble learning approach, and the improved predictor is called iDNA-KACC-EL. Experimental results on an independent dataset shows that iDNA-KACC-EL outperforms all the other state-of-the-art predictors, indicating that it would be a useful computational tool for DNA binding protein identification. .
DNA结合蛋白在从DNA复制到基因表达调控等各种细胞内和细胞外活动中起着关键作用。随着下一代测序技术的迅速发展,蛋白质序列的数量正以前所未有的速度增长。因此,有必要开发仅基于蛋白质序列信息来识别DNA结合蛋白的计算方法。在本研究中,提出了一种名为iDNA-KACC的新方法,该方法结合了支持向量机(SVM)和自交叉协方差变换。首先将蛋白质序列转换为基于轮廓的蛋白质表示形式,然后通过具有Kmer组成的自交叉协方差变换将其转换为一系列固定长度的向量。该方案可以有效地捕捉序列顺序效应。然后将这些向量输入支持向量机(SVM),以区分DNA结合蛋白和非DNA结合蛋白。通过严格的留一法检验,iDNA-KACC的总体准确率达到75.16%,马修相关系数为0.5。通过采用集成学习方法,其性能进一步提高,改进后的预测器称为iDNA-KACC-EL。在一个独立数据集上的实验结果表明,iDNA-KACC-EL优于所有其他现有的预测器,表明它将是一种用于DNA结合蛋白识别的有用计算工具。