College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China.
College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China; Key Laboratory of Computational Science and Application of Hainan Province, Haikou, 571158, China.
J Mol Graph Model. 2021 Sep;107:107962. doi: 10.1016/j.jmgm.2021.107962. Epub 2021 Jun 15.
Ubiquitination is a common and reversible post-translational protein modification that regulates apoptosis and plays an important role in protein degradation and cell diseases. However, experimental identification of protein ubiquitination sites is usually time-consuming and labor-intensive, so it is necessary to establish effective predictors. In this study, we propose a ubiquitination sites prediction method based on multi-view features, namely UbiSite-XGBoost. Firstly, we use seven single-view features encoding methods to convert protein sequence fragments into digital information. Secondly, the least absolute shrinkage and selection operator (LASSO) is applied to remove the redundant information and get the optimal feature subsets. Finally, these features are inputted into the eXtreme gradient boosting (XGBoost) classifier to predict ubiquitination sites. Five-fold cross-validation shows that the AUC values of Set1-Set6 datasets are 0.8258, 0.7592, 0.7853, 0.8345, 0.8979 and 0.8901, respectively. The synthetic minority oversampling technique (SMOTE) is employed in Set4-Set6 unbalanced datasets, and the AUC values are 0.9777, 0.9782 and 0.9860, respectively. In addition, we have constructed three independent test datasets which the AUC values are 0.8007, 0.6897 and 0.7280, respectively. The results show that the proposed method UbiSite-XGBoost is superior to other ubiquitination prediction methods and it provides new guidance for the identification of ubiquitination sites. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/UbiSite-XGBoost/.
泛素化是一种常见且可逆转的蛋白质翻译后修饰,它调节细胞凋亡,并在蛋白质降解和细胞疾病中发挥重要作用。然而,蛋白质泛素化位点的实验鉴定通常既耗时又费力,因此有必要建立有效的预测器。在这项研究中,我们提出了一种基于多视图特征的泛素化位点预测方法,即 UbiSite-XGBoost。首先,我们使用七种单视图特征编码方法将蛋白质序列片段转换为数字信息。其次,应用最小绝对收缩和选择算子 (LASSO) 去除冗余信息并获得最优特征子集。最后,将这些特征输入到极端梯度提升 (XGBoost) 分类器中以预测泛素化位点。五重交叉验证表明,Set1-Set6 数据集的 AUC 值分别为 0.8258、0.7592、0.7853、0.8345、0.8979 和 0.8901。在 Set4-Set6 不平衡数据集上使用合成少数过采样技术 (SMOTE),AUC 值分别为 0.9777、0.9782 和 0.9860。此外,我们构建了三个独立的测试数据集,AUC 值分别为 0.8007、0.6897 和 0.7280。结果表明,所提出的方法 UbiSite-XGBoost 优于其他泛素化预测方法,为泛素化位点的识别提供了新的指导。源代码和所有数据集可在 https://github.com/QUST-AIBBDRC/UbiSite-XGBoost/ 上获取。