School of Information Science and Engineering, Yunnan University, Kunming 650504, China.
Biomed Res Int. 2020 Jan 14;2020:4071508. doi: 10.1155/2020/4071508. eCollection 2020.
Apoptosis proteins are strongly related to many diseases and play an indispensable role in maintaining the dynamic balance between cell death and division . Obtaining localization information on apoptosis proteins is necessary in understanding their function. To date, few researchers have focused on the problem of apoptosis data imbalance before classification, while this data imbalance is prone to misclassification. Therefore, in this work, we introduce a method to resolve this problem and to enhance prediction accuracy. Firstly, the features of the protein sequence are captured by combining Improving Pseudo-Position-Specific Scoring Matrix (IM-Psepssm) with the Bidirectional Correlation Coefficient (Bid-CC) algorithm from position-specific scoring matrix. Secondly, different features of fusion and resampling strategies are used to reduce the impact of imbalance on apoptosis protein datasets. Finally, the eigenvector adopts the Support Vector Machine (SVM) to the training classification model, and the prediction accuracy is evaluated by jackknife cross-validation tests. The experimental results indicate that, under the same feature vector, adopting resampling methods remarkably boosts many significant indicators in the unsampling method for predicting the localization of apoptosis proteins in the ZD98, ZW225, and CL317 databases. Additionally, we also present new user-friendly local software for readers to apply; the codes and software can be freely accessed at https://github.com/ruanxiaoli/Im-Psepssm.
凋亡蛋白与许多疾病密切相关,在维持细胞死亡和分裂的动态平衡中起着不可或缺的作用。获得凋亡蛋白的定位信息对于理解其功能是必要的。迄今为止,很少有研究人员关注分类前凋亡数据不平衡的问题,而这种数据不平衡容易导致分类错误。因此,在这项工作中,我们引入了一种解决这个问题并提高预测准确性的方法。首先,通过将改进的伪位置特异性评分矩阵(IM-Psepssm)与来自位置特异性评分矩阵的双向相关系数(Bid-CC)算法相结合,捕获蛋白质序列的特征。其次,采用不同的融合和重采样策略的特征来减少不平衡对凋亡蛋白数据集的影响。最后,特征向量采用支持向量机(SVM)对训练分类模型进行训练,并通过Jackknife 交叉验证测试评估预测精度。实验结果表明,在相同的特征向量下,采用重采样方法可以显著提高在 ZD98、ZW225 和 CL317 数据库中预测凋亡蛋白定位的无采样方法中的许多重要指标。此外,我们还为读者提供了新的用户友好的本地软件;代码和软件可在 https://github.com/ruanxiaoli/Im-Psepssm 上免费获取。