Qian Lili, Wen Yaping, Han Guosheng
Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China.
Front Genet. 2020 Apr 3;11:275. doi: 10.3389/fgene.2020.00275. eCollection 2020.
The cancerlectin plays an important role in the initiation, survival, growth, metastasis, and spread of cancer. Therefore, to study the function of cancerlectin is greatly significant because it can help to identify tumor markers and tumor prevention, treatment, and prognosis. However, plenty of studies have generated a large amount of protein data. Traditional prediction methods have been unable to meet the needs of analysis. Developing powerful computational models based on these data to discriminate cancerlectins and non-cancerlectins on a large scale has been treated as one of the most important topics. In this study, we developed a feature extraction method to identify cancerlectins based on fusion of g-gap dipeptides. The analysis of variance was used to select the optimal feature set and a support vector machine was used to classify the data. The rigorous nested 10-fold cross-validation results, demonstrated that our method obtained the prediction accuracy of 83.91% and sensitivity of 83.15%. At the same time, in order to evaluate the performance of the classification model constructed in this work, we constructed a new data set. The prediction accuracy of the new data set reaches 83.3%. Experimental results show that the performance of our method is better than the state-of-the-art methods.
癌凝集素在癌症的起始、存活、生长、转移和扩散过程中发挥着重要作用。因此,研究癌凝集素的功能具有重大意义,因为这有助于识别肿瘤标志物以及进行肿瘤的预防、治疗和预后评估。然而,大量研究已产生了海量的蛋白质数据。传统的预测方法已无法满足分析需求。基于这些数据开发强大的计算模型以大规模区分癌凝集素和非癌凝集素已被视为最重要的课题之一。在本研究中,我们开发了一种基于g-gap二肽融合来识别癌凝集素的特征提取方法。采用方差分析来选择最优特征集,并使用支持向量机对数据进行分类。严格的嵌套10折交叉验证结果表明,我们的方法获得了83.91%的预测准确率和83.15%的灵敏度。同时,为了评估本研究构建的分类模型的性能,我们构建了一个新的数据集。新数据集的预测准确率达到了83.3%。实验结果表明,我们方法的性能优于当前最先进的方法。