Khalid Ayesha, Kaleem Afshan, Qazi Wajahat, Abdullah Roheena, Iqtedar Mehwish, Naz Shagufta
Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan.
Department of Computer Science, COMSATS University, Islamabad, Pakistan.
PLoS One. 2024 Dec 31;19(12):e0316215. doi: 10.1371/journal.pone.0316215. eCollection 2024.
Protein glycosylation, a vital post-translational modification, is pivotal in various biological processes and disease pathogenesis. Computational approaches, including protein language models and machine learning algorithms, have emerged as valuable tools for predicting O-GlcNAc sites, reducing experimental costs, and enhancing efficiency. However, the literature has not reported the prediction of O-GlcNAc sites through the evolutionary scale model (ESM). Therefore, this study employed the ESM-2 model for O-GlcNAc site prediction in humans. Approximately 1100 O-linked glycoprotein sequences retrieved from the O-GlcNAc database were utilized for model training. The ESM-2 model exhibited consistent improvement over epochs, achieving an accuracy of 78.30%, recall of 78.30%, precision of 61.31%, and F1-score of 68.74%. However, compared to the traditional models which show an overfitting on the same data up to 99%, ESM-2 model outperforms in terms of optimal training and testing predictions. These findings underscore the effectiveness of the ESM-2 model in accurately predicting O-GlcNAc sites within human proteins. Accurately predicting O-GlcNAc sites within human proteins can significantly advance glycoproteomic research by enhancing our understanding of protein function and disease mechanisms, aiding in developing targeted therapies, and facilitating biomarker discovery for improved diagnosis and treatment. Furthermore, future studies should focus on more diverse data types, longer protein sequence lengths, and higher computational resources to evaluate various parameters. Accurate prediction of O-GlcNAc sites might enhance the investigation of the site-specific functions of proteins in physiology and diseases.
蛋白质糖基化是一种至关重要的翻译后修饰,在各种生物过程和疾病发病机制中起着关键作用。包括蛋白质语言模型和机器学习算法在内的计算方法,已成为预测O-连接的N-乙酰葡糖胺(O-GlcNAc)位点、降低实验成本和提高效率的宝贵工具。然而,文献中尚未报道通过进化尺度模型(ESM)预测O-GlcNAc位点。因此,本研究采用ESM-2模型对人类的O-GlcNAc位点进行预测。从O-GlcNAc数据库中检索到的约1100个O-连接糖蛋白序列用于模型训练。ESM-2模型在各个训练轮次中表现出持续的改进,准确率达到78.30%,召回率为78.30%,精确率为61.31%,F1分数为68.74%。然而,与在相同数据上显示高达99%过拟合的传统模型相比,ESM-2模型在最优训练和测试预测方面表现更优。这些发现强调了ESM-2模型在准确预测人类蛋白质中O-GlcNAc位点方面的有效性。准确预测人类蛋白质中的O-GlcNAc位点可通过增强我们对蛋白质功能和疾病机制的理解、协助开发靶向治疗以及促进生物标志物发现以改善诊断和治疗,从而显著推动糖蛋白质组学研究。此外,未来的研究应关注更多样化的数据类型、更长的蛋白质序列长度和更高的计算资源,以评估各种参数。准确预测O-GlcNAc位点可能会加强对蛋白质在生理和疾病中位点特异性功能的研究。