Zhu Yan, Yin Shuwan, Zheng Jia, Shi Yixia, Jia Cangzhi
School of Science, Dalian Maritime University, Dalian 116026, P. R. China.
School of Mathematics and Statistics, Lingnan Normal University, Zhanjiang 524048, P. R. China.
J Bioinform Comput Biol. 2022 Feb;20(1):2150029. doi: 10.1142/S0219720021500293. Epub 2021 Nov 19.
O-glycosylation is a protein posttranslational modification important in regulating almost all cells. It is related to a large number of physiological and pathological phenomena. Recognizing O-glycosylation sites is the key to further investigating the molecular mechanism of protein posttranslational modification. This study aimed to collect a reliable dataset on and develop an O-glycosylation predictor for , named , through multiple features. A random undersampling method and a synthetic minority oversampling technique were employed to deal with imbalanced data. In addition, the Kruskal-Wallis (K-W) test was adopted to optimize feature vectors and improve the performance of the model. A support vector machine, due to its optimal performance, was used to train and optimize the final prediction model after a comprehensive comparison of various classifiers in traditional machine learning methods and deep learning. On the independent test set, outperformed the existing O-glycosylation tool, suggesting that could provide more instructive guidance for further experimental research on O-glycosylation. The source code and datasets are available at https://github.com/YanZhu06/Captor/.
O-糖基化是一种蛋白质翻译后修饰,对几乎所有细胞的调节都很重要。它与大量的生理和病理现象相关。识别O-糖基化位点是进一步研究蛋白质翻译后修饰分子机制的关键。本研究旨在收集一个可靠的数据集,并通过多种特征开发一种用于O-糖基化预测的工具,名为Captor。采用随机欠采样方法和合成少数过采样技术来处理数据不平衡问题。此外,采用Kruskal-Wallis(K-W)检验来优化特征向量并提高模型性能。在对传统机器学习方法和深度学习中的各种分类器进行全面比较后,由于支持向量机性能最优,因此使用它来训练和优化最终的预测模型。在独立测试集上,Captor的表现优于现有的O-糖基化工具,这表明Captor可为O-糖基化的进一步实验研究提供更具指导性的指导。源代码和数据集可在https://github.com/YanZhu06/Captor/获取。