College of Intelligence and Computing, Tianjin University, 300350, Tianjin, China.
College of Computer and Information Engineering, Inner Mongolia Agricultural University, 010018, Hohhot, China.
Proteomics. 2019 Sep;19(17):e1900007. doi: 10.1002/pmic.201900007. Epub 2019 Aug 8.
Secretory proteins of Mycobacterium tuberculosis have created more concern, given their dominant immunogenicity and role in pathogenesis. In view of expensive and time-consuming traditional biochemical experiments, an advanced support vector machine model named SecProMTB is constructed in this study and the proteins are identified by a bioinformatic approach. First, an improved pseudo-amino acid composition (PseAAC) algorithm is used to extract features from all entities. Second, a novel imbalanced-data strategy is proposed and adopted to divide the original data set into train set and test set. Third, to overcome the overfitting problem, feature-ranking algorithms are applied with an increment feature selection. Finally, the model is trained and optimized. Consequently, a model is obtained with an area under the curve of 0.862 and average accuracy of 86% in the independent test. For the convenience of users, SecProMTB and related data are openly accessible at http://server.malab.cn/SecProMTB/index.jsp.
结核分枝杆菌的分泌蛋白因其主要的免疫原性和在发病机制中的作用而引起了更多的关注。鉴于传统生化实验昂贵且耗时,本研究构建了一种名为 SecProMTB 的先进支持向量机模型,并通过生物信息学方法对这些蛋白质进行了鉴定。首先,使用改进的伪氨基酸组成(PseAAC)算法从所有实体中提取特征。其次,提出并采用了一种新的不平衡数据策略,将原始数据集划分为训练集和测试集。第三,为了克服过拟合问题,应用特征排序算法并进行增量特征选择。最后,对模型进行训练和优化。因此,在独立测试中,该模型的曲线下面积为 0.862,平均准确率为 86%。为了方便用户,SecProMTB 及相关数据可在 http://server.malab.cn/SecProMTB/index.jsp 上公开获取。