Zhang Zixiao, Gong Yue, Gao Bo, Li Hongfei, Gao Wentao, Zhao Yuming, Dong Benzhi
College of Information and Computer Engineering, Northeast Forestry University, Harbin, China.
Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin, China.
Front Genet. 2021 Dec 20;12:809001. doi: 10.3389/fgene.2021.809001. eCollection 2021.
Soluble N-ethylmaleimide sensitive factor activating protein receptor (SNARE) proteins are a large family of transmembrane proteins located in organelles and vesicles. The important roles of SNARE proteins include initiating the vesicle fusion process and activating and fusing proteins as they undergo exocytosis activity, and SNARE proteins are also vital for the transport regulation of membrane proteins and non-regulatory vesicles. Therefore, there is great significance in establishing a method to efficiently identify SNARE proteins. However, the identification accuracy of the existing methods such as SNARE CNN is not satisfied. In our study, we developed a method based on a support vector machine (SVM) that can effectively recognize SNARE proteins. We used the position-specific scoring matrix (PSSM) method to extract features of SNARE protein sequences, used the support vector machine recursive elimination correlation bias reduction (SVM-RFE-CBR) algorithm to rank the importance of features, and then screened out the optimal subset of feature data based on the sorted results. We input the feature data into the model when building the model, used 10-fold crossing validation for training, and tested model performance by using an independent dataset. In independent tests, the ability of our method to identify SNARE proteins achieved a sensitivity of 68%, specificity of 94%, accuracy of 92%, area under the curve (AUC) of 84%, and Matthew's correlation coefficient (MCC) of 0.48. The results of the experiment show that the common evaluation indicators of our method are excellent, indicating that our method performs better than other existing classification methods in identifying SNARE proteins.
可溶性N - 乙基马来酰亚胺敏感因子激活蛋白受体(SNARE)蛋白是位于细胞器和囊泡中的一大类跨膜蛋白。SNARE蛋白的重要作用包括启动囊泡融合过程以及在进行胞吐活动时激活和融合蛋白,并且SNARE蛋白对于膜蛋白和非调节性囊泡的运输调节也至关重要。因此,建立一种有效识别SNARE蛋白的方法具有重要意义。然而,诸如SNARE CNNCNN等CNN等现有方法的识别准确率并不令人满意。在我们的研究中,我们开发了一种基于支持向量机(SVM)的方法,该方法可以有效地识别SNARE蛋白。我们使用位置特异性得分矩阵(PSSM)方法提取SNARE蛋白序列的特征,使用支持向量机递归消除相关偏差减少(SVM - RFE - CBR)算法对特征的重要性进行排序,然后根据排序结果筛选出特征数据的最优子集。在构建模型时,我们将特征数据输入模型,使用10折交叉验证进行训练,并使用独立数据集测试模型性能。在独立测试中,我们的方法识别SNARE蛋白的能力达到了68%的灵敏度、94%的特异性、92%的准确率、84%的曲线下面积(AUC)以及0.48的马修斯相关系数(MCC)。实验结果表明,我们方法的常见评估指标优异,表明我们的方法在识别SNARE蛋白方面比其他现有分类方法表现更好。