通过结合自互协方差变换和集成学习来鉴定DNA结合蛋白。

Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning.

作者信息

Liu Bin, Wang Shanyi, Dong Qiwen, Li Shumin, Liu Xuan

出版信息

IEEE Trans Nanobioscience. 2016 Jun;15(4):328-334. doi: 10.1109/TNB.2016.2555951. Epub 2016 Apr 20.

DOI:10.1109/TNB.2016.2555951

Abstract

DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. With the rapid development of next generation of sequencing technique, the number of protein sequences is unprecedentedly increasing. Thus it is necessary to develop computational methods to identify the DNA-binding proteins only based on the protein sequence information. In this study, a novel method called iDNA-KACC is presented, which combines the Support Vector Machine (SVM) and the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. The sequence order effect can be effectively captured by this scheme. These vectors are then fed into Support Vector Machine (SVM) to discriminate the DNA-binding proteins from the non DNA-binding ones. iDNA-KACC achieves an overall accuracy of 75.16% and Matthew correlation coefficient of 0.5 by a rigorous jackknife test. Its performance is further improved by employing an ensemble learning approach, and the improved predictor is called iDNA-KACC-EL. Experimental results on an independent dataset shows that iDNA-KACC-EL outperforms all the other state-of-the-art predictors, indicating that it would be a useful computational tool for DNA binding protein identification. .

摘要

DNA结合蛋白在从DNA复制到基因表达调控等各种细胞内和细胞外活动中起着关键作用。随着下一代测序技术的迅速发展，蛋白质序列的数量正以前所未有的速度增长。因此，有必要开发仅基于蛋白质序列信息来识别DNA结合蛋白的计算方法。在本研究中，提出了一种名为iDNA-KACC的新方法，该方法结合了支持向量机（SVM）和自交叉协方差变换。首先将蛋白质序列转换为基于轮廓的蛋白质表示形式，然后通过具有Kmer组成的自交叉协方差变换将其转换为一系列固定长度的向量。该方案可以有效地捕捉序列顺序效应。然后将这些向量输入支持向量机（SVM），以区分DNA结合蛋白和非DNA结合蛋白。通过严格的留一法检验，iDNA-KACC的总体准确率达到75.16%，马修相关系数为0.5。通过采用集成学习方法，其性能进一步提高，改进后的预测器称为iDNA-KACC-EL。在一个独立数据集上的实验结果表明，iDNA-KACC-EL优于所有其他现有的预测器，表明它将是一种用于DNA结合蛋白识别的有用计算工具。

相似文献

Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning.通过结合自互协方差变换和集成学习来鉴定DNA结合蛋白。

IEEE Trans Nanobioscience. 2016 Jun;15(4):328-334. doi: 10.1109/TNB.2016.2555951. Epub 2016 Apr 20.

Protein remote homology detection based on auto-cross covariance transformation.基于自交协方差变换的蛋白质远程同源检测。

Comput Biol Med. 2011 Aug;41(8):640-7. doi: 10.1016/j.compbiomed.2011.05.015. Epub 2011 Jun 12.

PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou's PseAAC and Physicochemical Distance Transformation.PseDNA-Pro：结合周氏伪氨基酸组成和物理化学距离变换的DNA结合蛋白鉴定方法

Mol Inform. 2015 Jan;34(1):8-17. doi: 10.1002/minf.201400025. Epub 2014 Sep 26.

Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation.通过结合支持向量机和位置特异性得分矩阵距离变换来识别DNA结合蛋白。

BMC Syst Biol. 2015;9 Suppl 1(Suppl 1):S10. doi: 10.1186/1752-0509-9-S1-S10. Epub 2015 Feb 6.

Recombination Hotspot/Coldspot Identification Combining Three Different Pseudocomponents via an Ensemble Learning Approach.通过集成学习方法结合三种不同拟组份识别重组热点/冷点。

Biomed Res Int. 2016;2016:8527435. doi: 10.1155/2016/8527435. Epub 2016 Aug 25.

A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation.基于自互协方差变换的新分类学蛋白质折叠识别方法。

Bioinformatics. 2009 Oct 15;25(20):2655-62. doi: 10.1093/bioinformatics/btp500. Epub 2009 Aug 25.

newDNA-Prot: Prediction of DNA-binding proteins by employing support vector machine and a comprehensive sequence representation.新型DNA-蛋白质：利用支持向量机和综合序列表示法预测DNA结合蛋白

Comput Biol Chem. 2014 Oct;52:51-9. doi: 10.1016/j.compbiolchem.2014.09.002. Epub 2014 Sep 15.

gDNA-Prot: Predict DNA-binding proteins by employing support vector machine and a novel numerical characterization of protein sequence.gDNA-Prot：利用支持向量机和蛋白质序列的新型数值表征预测DNA结合蛋白。

J Theor Biol. 2016 Oct 7;406:8-16. doi: 10.1016/j.jtbi.2016.06.002. Epub 2016 Jul 1.

Two multi-classification strategies used on SVM to predict protein structural classes by using auto covariance.两种使用自协方差的 SVM 多分类策略用于预测蛋白质结构类别。

Interdiscip Sci. 2009 Dec;1(4):315-9. doi: 10.1007/s12539-009-0066-1. Epub 2009 Nov 14.

PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation.PSFM-DBT：通过结合位置特异性频率矩阵和距离双字母变换识别DNA结合蛋白。

Int J Mol Sci. 2017 Aug 25;18(9):1856. doi: 10.3390/ijms18091856.

引用本文的文献

ProkDBP: Toward more precise identification of prokaryotic DNA binding proteins.ProkDBP：致力于更精确地识别原核 DNA 结合蛋白。

Protein Sci. 2024 Jun;33(6):e5015. doi: 10.1002/pro.5015.

A novel hybrid model to predict concomitant diseases for Hashimoto's thyroiditis.一种用于预测桥本甲状腺炎伴发病的新型混合模型。

BMC Bioinformatics. 2023 Aug 24;24(1):319. doi: 10.1186/s12859-023-05443-5.

Hybrid_DBP: Prediction of DNA-binding proteins using hybrid features and convolutional neural networks.Hybrid_DBP：利用混合特征和卷积神经网络预测DNA结合蛋白。

Front Pharmacol. 2022 Oct 10;13:1031759. doi: 10.3389/fphar.2022.1031759. eCollection 2022.

Identify DNA-Binding Proteins Through the Extreme Gradient Boosting Algorithm.通过极端梯度提升算法识别DNA结合蛋白。

Front Genet. 2022 Jan 28;12:821996. doi: 10.3389/fgene.2021.821996. eCollection 2021.

KK-DBP: A Multi-Feature Fusion Method for DNA-Binding Protein Identification Based on Random Forest.KK-DBP：一种基于随机森林的用于DNA结合蛋白识别的多特征融合方法

Front Genet. 2021 Nov 29;12:811158. doi: 10.3389/fgene.2021.811158. eCollection 2021.

SAResNet: self-attention residual network for predicting DNA-protein binding.SAResNet：用于预测 DNA-蛋白质结合的自注意力残差网络。

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab101.

Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification.大规模比较综述与评估抗癌肽鉴定的计算方法。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa312.

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features.一种基于简化氨基酸和混合特征的嗜热蛋白预测方法。

Front Bioeng Biotechnol. 2020 May 5;8:285. doi: 10.3389/fbioe.2020.00285. eCollection 2020.

Predictions of Apoptosis Proteins by Integrating Different Features Based on Improving Pseudo-Position-Specific Scoring Matrix.基于改进的伪位置特异性评分矩阵的整合不同特征预测细胞凋亡蛋白

Biomed Res Int. 2020 Jan 14;2020:4071508. doi: 10.1155/2020/4071508. eCollection 2020.

PredDBP-Stack: Prediction of DNA-Binding Proteins from HMM Profiles using a Stacked Ensemble Method.PredDBP-Stack：基于堆叠集成方法的使用 HMM 轮廓预测 DNA 结合蛋白

Biomed Res Int. 2020 Apr 13;2020:7297631. doi: 10.1155/2020/7297631. eCollection 2020.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过结合自互协方差变换和集成学习来鉴定DNA结合蛋白。

Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning.

作者信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献