SecProMTB：基于支持向量机的分泌蛋白分类器，使用不平衡数据集应用于结核分枝杆菌。

SecProMTB: Support Vector Machine-Based Classifier for Secretory Proteins Using Imbalanced Data Sets Applied to Mycobacterium tuberculosis.

机构信息

College of Intelligence and Computing, Tianjin University, 300350, Tianjin, China.

College of Computer and Information Engineering, Inner Mongolia Agricultural University, 010018, Hohhot, China.

出版信息

Proteomics. 2019 Sep;19(17):e1900007. doi: 10.1002/pmic.201900007. Epub 2019 Aug 8.

DOI:10.1002/pmic.201900007

PMID:31348610

Abstract

Secretory proteins of Mycobacterium tuberculosis have created more concern, given their dominant immunogenicity and role in pathogenesis. In view of expensive and time-consuming traditional biochemical experiments, an advanced support vector machine model named SecProMTB is constructed in this study and the proteins are identified by a bioinformatic approach. First, an improved pseudo-amino acid composition (PseAAC) algorithm is used to extract features from all entities. Second, a novel imbalanced-data strategy is proposed and adopted to divide the original data set into train set and test set. Third, to overcome the overfitting problem, feature-ranking algorithms are applied with an increment feature selection. Finally, the model is trained and optimized. Consequently, a model is obtained with an area under the curve of 0.862 and average accuracy of 86% in the independent test. For the convenience of users, SecProMTB and related data are openly accessible at http://server.malab.cn/SecProMTB/index.jsp.

摘要

结核分枝杆菌的分泌蛋白因其主要的免疫原性和在发病机制中的作用而引起了更多的关注。鉴于传统生化实验昂贵且耗时，本研究构建了一种名为 SecProMTB 的先进支持向量机模型，并通过生物信息学方法对这些蛋白质进行了鉴定。首先，使用改进的伪氨基酸组成（PseAAC）算法从所有实体中提取特征。其次，提出并采用了一种新的不平衡数据策略，将原始数据集划分为训练集和测试集。第三，为了克服过拟合问题，应用特征排序算法并进行增量特征选择。最后，对模型进行训练和优化。因此，在独立测试中，该模型的曲线下面积为 0.862，平均准确率为 86%。为了方便用户，SecProMTB 及相关数据可在 http://server.malab.cn/SecProMTB/index.jsp 上公开获取。

相似文献

SecProMTB: Support Vector Machine-Based Classifier for Secretory Proteins Using Imbalanced Data Sets Applied to Mycobacterium tuberculosis.SecProMTB：基于支持向量机的分泌蛋白分类器，使用不平衡数据集应用于结核分枝杆菌。

Proteomics. 2019 Sep;19(17):e1900007. doi: 10.1002/pmic.201900007. Epub 2019 Aug 8.

DPP-PseAAC: A DNA-binding protein prediction model using Chou's general PseAAC.DPP-PseAAC：一种基于 Chou 的通用 PseAAC 的 DNA 结合蛋白预测模型。

J Theor Biol. 2018 Sep 7;452:22-34. doi: 10.1016/j.jtbi.2018.05.006. Epub 2018 May 16.

DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation.通过结合伪氨基酸组成和基于轮廓的蛋白质表示来鉴定DNA结合蛋白

Sci Rep. 2015 Oct 20;5:15479. doi: 10.1038/srep15479.

Identify Gram-negative bacterial secreted protein types by incorporating different modes of PSSM into Chou's general PseAAC via Kullback-Leibler divergence.通过将不同模式的 PSSM 纳入 Chou 的广义 PseAAC 并通过 KL 散度来识别革兰氏阴性细菌分泌蛋白类型。

J Theor Biol. 2018 Oct 7;454:22-29. doi: 10.1016/j.jtbi.2018.05.035. Epub 2018 May 29.

Prediction of subcellular location of mycobacterial protein using feature selection techniques.利用特征选择技术预测分枝杆菌蛋白的亚细胞定位。

Mol Divers. 2010 Nov;14(4):667-71. doi: 10.1007/s11030-009-9205-1. Epub 2009 Nov 12.

predCar-site: Carbonylation sites prediction in proteins using support vector machine with resolving data imbalanced issue.predCar-site：使用支持向量机预测蛋白质中的羰基化位点并解决数据不平衡问题。

Anal Biochem. 2017 May 15;525:107-113. doi: 10.1016/j.ab.2017.03.008. Epub 2017 Mar 9.

CWLy-pred: A novel cell wall lytic enzyme identifier based on an improved MRMD feature selection method.CWLy-pred：一种基于改进的 MRMD 特征选择方法的新型细胞壁裂解酶标识符。

Genomics. 2020 Nov;112(6):4715-4721. doi: 10.1016/j.ygeno.2020.08.015. Epub 2020 Aug 19.

RFAmyloid: A Web Server for Predicting Amyloid Proteins.RFAmyloid：用于预测淀粉样蛋白的网络服务器。

Int J Mol Sci. 2018 Jul 16;19(7):2071. doi: 10.3390/ijms19072071.

Prediction of essential proteins in prokaryotes by incorporating various physico-chemical features into the general form of Chou's pseudo amino acid composition.通过将各种物理化学特征纳入周氏伪氨基酸组成的一般形式来预测原核生物中的必需蛋白质。

Protein Pept Lett. 2013 Jul 1;20(7):781-95. doi: 10.2174/0929866511320070008.

Effective DNA binding protein prediction by using key features via Chou's general PseAAC.利用周元的通用 PseAAC 算法通过关键特征预测有效 DNA 结合蛋白。

J Theor Biol. 2019 Jan 7;460:64-78. doi: 10.1016/j.jtbi.2018.10.027. Epub 2018 Oct 11.

引用本文的文献

Predicting promoters based on novel feature descriptor and feature selection technique.基于新型特征描述符和特征选择技术预测启动子。

Front Microbiol. 2023 Mar 2;14:1141227. doi: 10.3389/fmicb.2023.1141227. eCollection 2023.

A GHKNN model based on the physicochemical property extraction method to identify SNARE proteins.一种基于物理化学性质提取方法的GHKNN模型，用于识别SNARE蛋白。

Front Genet. 2022 Nov 23;13:935717. doi: 10.3389/fgene.2022.935717. eCollection 2022.

A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem.一种基于iLearnPlus的SNARE蛋白识别方法，可有效解决数据不平衡问题。

Front Genet. 2022 Jan 28;12:818841. doi: 10.3389/fgene.2021.818841. eCollection 2021.

iTTCA-RF: a random forest predictor for tumor T cell antigens.iTTCA-RF：一种用于肿瘤 T 细胞抗原的随机森林预测器。

J Transl Med. 2021 Oct 27;19(1):449. doi: 10.1186/s12967-021-03084-x.

Recognizing Pattern and Rule of Mutation Signatures Corresponding to Cancer Types.识别与癌症类型相对应的突变特征模式和规律。

Front Cell Dev Biol. 2021 Aug 26;9:712931. doi: 10.3389/fcell.2021.712931. eCollection 2021.

iPMI: Machine Learning-Aided Identification of Parametrial Invasion in Women with Early-Stage Cervical Cancer.iPMI：机器学习辅助识别早期宫颈癌患者的宫旁浸润

Diagnostics (Basel). 2021 Aug 12;11(8):1454. doi: 10.3390/diagnostics11081454.

ACP-DA: Improving the Prediction of Anticancer Peptides Using Data Augmentation.ACP-DA：利用数据增强改进抗癌肽的预测

Front Genet. 2021 Jun 30;12:698477. doi: 10.3389/fgene.2021.698477. eCollection 2021.

Accurate identification of RNA D modification using multiple features.使用多种特征准确识别 RNA D 修饰。

RNA Biol. 2021 Dec;18(12):2236-2246. doi: 10.1080/15476286.2021.1898160. Epub 2021 Mar 17.

BOW-GBDT: A GBDT Classifier Combining With Artificial Neural Network for Identifying GPCR-Drug Interaction Based on Wordbook Learning From Sequences.BOW-GBDT：一种基于从序列中学习词典的结合人工神经网络的GBDT分类器，用于识别GPCR-药物相互作用。

Front Cell Dev Biol. 2021 Feb 1;8:623858. doi: 10.3389/fcell.2020.623858. eCollection 2020.

Predicting Preference of Transcription Factors for Methylated DNA Using Sequence Information.利用序列信息预测转录因子对甲基化DNA的偏好性

Mol Ther Nucleic Acids. 2020 Jul 31;22:1043-1050. doi: 10.1016/j.omtn.2020.07.035. eCollection 2020 Dec 4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

SecProMTB：基于支持向量机的分泌蛋白分类器，使用不平衡数据集应用于结核分枝杆菌。

SecProMTB: Support Vector Machine-Based Classifier for Secretory Proteins Using Imbalanced Data Sets Applied to Mycobacterium tuberculosis.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献