Suppr超能文献

通过结合自互协方差变换和集成学习来鉴定DNA结合蛋白。

Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning.

作者信息

Liu Bin, Wang Shanyi, Dong Qiwen, Li Shumin, Liu Xuan

出版信息

IEEE Trans Nanobioscience. 2016 Jun;15(4):328-334. doi: 10.1109/TNB.2016.2555951. Epub 2016 Apr 20.

Abstract

DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. With the rapid development of next generation of sequencing technique, the number of protein sequences is unprecedentedly increasing. Thus it is necessary to develop computational methods to identify the DNA-binding proteins only based on the protein sequence information. In this study, a novel method called iDNA-KACC is presented, which combines the Support Vector Machine (SVM) and the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. The sequence order effect can be effectively captured by this scheme. These vectors are then fed into Support Vector Machine (SVM) to discriminate the DNA-binding proteins from the non DNA-binding ones. iDNA-KACC achieves an overall accuracy of 75.16% and Matthew correlation coefficient of 0.5 by a rigorous jackknife test. Its performance is further improved by employing an ensemble learning approach, and the improved predictor is called iDNA-KACC-EL. Experimental results on an independent dataset shows that iDNA-KACC-EL outperforms all the other state-of-the-art predictors, indicating that it would be a useful computational tool for DNA binding protein identification. .

摘要

DNA结合蛋白在从DNA复制到基因表达调控等各种细胞内和细胞外活动中起着关键作用。随着下一代测序技术的迅速发展,蛋白质序列的数量正以前所未有的速度增长。因此,有必要开发仅基于蛋白质序列信息来识别DNA结合蛋白的计算方法。在本研究中,提出了一种名为iDNA-KACC的新方法,该方法结合了支持向量机(SVM)和自交叉协方差变换。首先将蛋白质序列转换为基于轮廓的蛋白质表示形式,然后通过具有Kmer组成的自交叉协方差变换将其转换为一系列固定长度的向量。该方案可以有效地捕捉序列顺序效应。然后将这些向量输入支持向量机(SVM),以区分DNA结合蛋白和非DNA结合蛋白。通过严格的留一法检验,iDNA-KACC的总体准确率达到75.16%,马修相关系数为0.5。通过采用集成学习方法,其性能进一步提高,改进后的预测器称为iDNA-KACC-EL。在一个独立数据集上的实验结果表明,iDNA-KACC-EL优于所有其他现有的预测器,表明它将是一种用于DNA结合蛋白识别的有用计算工具。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验