基于堆叠泛化和预训练蛋白质语言模型嵌入的人源 O 糖基化位点预测。

Prediction of human O-linked glycosylation sites using stacked generalization and embeddings from pre-trained protein language model.

机构信息

Department of Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, United States.

School of Computing, Wichita State University, Wichita, KS 67260, United States.

出版信息

Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae643.

DOI:10.1093/bioinformatics/btae643

PMID:39447059

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11552629/

Abstract

MOTIVATION

O-linked glycosylation, an essential post-translational modification process in Homo sapiens, involves attaching sugar moieties to the oxygen atoms of serine and/or threonine residues. It influences various biological and cellular functions. While threonine or serine residues within protein sequences are potential sites for O-linked glycosylation, not all serine and/or threonine residues undergo this modification, underscoring the importance of characterizing its occurrence. This study presents a novel approach for predicting intracellular and extracellular O-linked glycosylation events on proteins, which are crucial for comprehending cellular processes. Two base multi-layer perceptron models were trained by leveraging a stacked generalization framework. These base models respectively use ProtT5 and Ankh O-linked glycosylation site-specific embeddings whose combined predictions are used to train the meta-multi-layer perceptron model. Trained on extensive O-linked glycosylation datasets, the stacked-generalization model demonstrated high predictive performance on independent test datasets. Furthermore, the study emphasizes the distinction between nucleocytoplasmic and extracellular O-linked glycosylation, offering insights into their functional implications that were overlooked in previous studies. By integrating the protein language model's embedding with stacked generalization techniques, this approach enhances predictive accuracy of O-linked glycosylation events and illuminates the intricate roles of O-linked glycosylation in proteomics, potentially accelerating the discovery of novel glycosylation sites.

RESULTS

Stack-OglyPred-PLM produces Sensitivity, Specificity, Matthews Correlation Coefficient, and Accuracy of 90.50%, 89.60%, 0.464, and 89.70%, respectively on a benchmark NetOGlyc-4.0 independent test dataset. These results demonstrate that Stack-OglyPred-PLM is a robust computational tool to predict O-linked glycosylation sites in proteins.

AVAILABILITY AND IMPLEMENTATION

The developed tool, programs, training, and test dataset are available at https://github.com/PakhrinLab/Stack-OglyPred-PLM.

摘要

动机

O -linked 糖基化是人类中一种重要的翻译后修饰过程，涉及将糖基部分连接到丝氨酸和/或苏氨酸残基的氧原子上。它影响各种生物和细胞功能。虽然蛋白质序列中的丝氨酸或苏氨酸残基是 O 连接糖基化的潜在位点，但并非所有丝氨酸和/或苏氨酸残基都经历这种修饰，这突显了表征其发生的重要性。本研究提出了一种预测蛋白质细胞内和细胞外 O 连接糖基化事件的新方法，这对于理解细胞过程至关重要。两种基于碱基的多层感知器模型通过利用堆叠泛化框架进行训练。这些基本模型分别使用 ProtT5 和 Ankh O 连接糖基化位点特异性嵌入，其组合预测用于训练元多层感知器模型。在广泛的 O 连接糖基化数据集上进行训练，堆叠泛化模型在独立测试数据集上表现出高预测性能。此外，该研究强调了核质和细胞外 O 连接糖基化之间的区别，为其功能意义提供了新的见解，这些见解在以前的研究中被忽视了。通过将蛋白质语言模型的嵌入与堆叠泛化技术相结合，该方法提高了 O 连接糖基化事件的预测准确性，并阐明了 O 连接糖基化在蛋白质组学中的复杂作用，可能加速新糖基化位点的发现。

结果

Stack-OglyPred-PLM 在基准 NetOGlyc-4.0 独立测试数据集上的灵敏度、特异性、马修斯相关系数和准确性分别为 90.50%、89.60%、0.464 和 89.70%。这些结果表明，Stack-OglyPred-PLM 是一种强大的计算工具，可以预测蛋白质中的 O 连接糖基化位点。

可用性和实现

开发的工具、程序、培训和测试数据集可在 https://github.com/PakhrinLab/Stack-OglyPred-PLM 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7da3/11552629/798aee41f907/btae643f1.jpg

相似文献

Prediction of human O-linked glycosylation sites using stacked generalization and embeddings from pre-trained protein language model.

Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae643.

HOTGpred: Enhancing human O-linked threonine glycosylation prediction using integrated pretrained protein language model-based features and multi-stage feature selection approach.

Comput Biol Med. 2024 Sep;179:108859. doi: 10.1016/j.compbiomed.2024.108859. Epub 2024 Jul 18.

LMNglyPred: prediction of human N-linked glycosylation sites using embeddings from a pre-trained protein language model.

Glycobiology. 2023 Jun 3;33(5):411-422. doi: 10.1093/glycob/cwad033.

LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.

Integrating Embeddings from Multiple Protein Language Models to Improve Protein -GlcNAc Site Prediction.

Int J Mol Sci. 2023 Nov 6;24(21):16000. doi: 10.3390/ijms242116000.

pLMSNOSite: an ensemble-based approach for predicting protein S-nitrosylation sites by integrating supervised word embedding and embedding from pre-trained protein language model.

BMC Bioinformatics. 2023 Feb 8;24(1):41. doi: 10.1186/s12859-023-05164-9.

EMNGly: predicting N-linked glycosylation sites using the language models for feature extraction.

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad650.

Computational Prediction of N- and O-Linked Glycosylation Sites for Human and Mouse Proteins.

Methods Mol Biol. 2022;2499:177-186. doi: 10.1007/978-1-0716-2317-6_9.

O-GlyThr: Prediction of human O-linked threonine glycosites using multi-feature fusion.

Int J Biol Macromol. 2023 Jul 1;242(Pt 2):124761. doi: 10.1016/j.ijbiomac.2023.124761. Epub 2023 May 6.

TransPTM: a transformer-based model for non-histone acetylation site prediction.

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae219.

引用本文的文献

Multimodal deep learning for predicting protein ubiquitination sites.

Bioinform Adv. 2025 Aug 20;5(1):vbaf200. doi: 10.1093/bioadv/vbaf200. eCollection 2025.

The structural view of the protein PGD-219aa encoded by the circular RNA CircPGD.

J Mol Model. 2025 Aug 9;31(9):236. doi: 10.1007/s00894-025-06454-0.

Large Language Model (LLM)-Based Advances in Prediction of Post-translational Modification Sites in Proteins.

Methods Mol Biol. 2025;2941:313-355. doi: 10.1007/978-1-0716-4623-6_19.

Implications of Mucin-Type -Glycosylation in Alzheimer's Disease.

Molecules. 2025 Apr 24;30(9):1895. doi: 10.3390/molecules30091895.

Enhanced O-glycosylation site prediction using explainable machine learning technique with spatial local environment.

Bioinformatics. 2025 Feb 4;41(2). doi: 10.1093/bioinformatics/btaf034.

TargetCLP: clathrin proteins prediction combining transformed and evolutionary scale modeling-based multi-view features via weighted feature integration approach.

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf026.

本文引用的文献

SumoPred-PLM: human SUMOylation and SUMO2/3 sites Prediction using Pre-trained Protein Language Model.

NAR Genom Bioinform. 2024 Feb 7;6(1):lqae011. doi: 10.1093/nargab/lqae011. eCollection 2024 Mar.

O-GlcNAcPRED-DL: Prediction of Protein O-GlcNAcylation Sites Based on an Ensemble Model of Deep Learning.

J Proteome Res. 2024 Jan 5;23(1):95-106. doi: 10.1021/acs.jproteome.3c00458. Epub 2023 Dec 6.

LMPhosSite: A Deep Learning-Based Approach for General Protein Phosphorylation Site Prediction Using Embeddings from the Local Window Sequence and Pretrained Protein Language Model.

J Proteome Res. 2023 Aug 4;22(8):2548-2557. doi: 10.1021/acs.jproteome.2c00667. Epub 2023 Jul 17.

LMNglyPred: prediction of human N-linked glycosylation sites using embeddings from a pre-trained protein language model.

Glycobiology. 2023 Jun 3;33(5):411-422. doi: 10.1093/glycob/cwad033.

Global mapping of GalNAc-T isoform-specificities and O-glycosylation site-occupancy in a tissue-forming human cell line.

Nat Commun. 2022 Oct 21;13(1):6257. doi: 10.1038/s41467-022-33806-8.

DeepLoc 2.0: multi-label subcellular localization prediction using protein language models.

Nucleic Acids Res. 2022 Jul 5;50(W1):W228-W234. doi: 10.1093/nar/gkac278.

Protein embeddings and deep learning predict binding residues for various ligand classes.

Sci Rep. 2021 Dec 13;11(1):23916. doi: 10.1038/s41598-021-03431-4.

DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction.

Molecules. 2021 Dec 2;26(23):7314. doi: 10.3390/molecules26237314.

O-glycosylation site prediction for by combining properties and sequence features with support vector machine.

J Bioinform Comput Biol. 2022 Feb;20(1):2150029. doi: 10.1142/S0219720021500293. Epub 2021 Nov 19.

dbPTM in 2022: an updated database for exploring regulatory networks and functional associations of protein post-translational modifications.

Nucleic Acids Res. 2022 Jan 7;50(D1):D471-D479. doi: 10.1093/nar/gkab1017.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于堆叠泛化和预训练蛋白质语言模型嵌入的人源 O 糖基化位点预测。

Prediction of human O-linked glycosylation sites using stacked generalization and embeddings from pre-trained protein language model.

机构信息

Department of Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, United States.

School of Computing, Wichita State University, Wichita, KS 67260, United States.