LMNglyPred：使用预先训练的蛋白质语言模型的嵌入来预测人类 N-连接糖基化位点。

LMNglyPred: prediction of human N-linked glycosylation sites using embeddings from a pre-trained protein language model.

机构信息

School of Computing, Wichita State University, 1845 Fairmount St., Wichita, KS 67260, USA.

Department of Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, USA.

出版信息

Glycobiology. 2023 Jun 3;33(5):411-422. doi: 10.1093/glycob/cwad033.

DOI:10.1093/glycob/cwad033

PMID:37067908

Abstract

Protein N-linked glycosylation is an important post-translational mechanism in Homo sapiens, playing essential roles in many vital biological processes. It occurs at the N-X-[S/T] sequon in amino acid sequences, where X can be any amino acid except proline. However, not all N-X-[S/T] sequons are glycosylated; thus, the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation. In this regard, computational prediction of N-linked glycosylation sites confined to N-X-[S/T] sequons is an important problem that has not been extensively addressed by the existing methods, especially in regard to the creation of negative sets and leveraging the distilled information from protein language models (pLMs). Here, we developed LMNglyPred, a deep learning-based approach, to predict N-linked glycosylated sites in human proteins using embeddings from a pre-trained pLM. LMNglyPred produces sensitivity, specificity, Matthews Correlation Coefficient, precision, and accuracy of 76.50, 75.36, 0.49, 60.99, and 75.74 percent, respectively, on a benchmark-independent test set. These results demonstrate that LMNglyPred is a robust computational tool to predict N-linked glycosylation sites confined to the N-X-[S/T] sequon.

摘要

蛋白质 N 连接糖基化是人类中一种重要的翻译后机制，在许多重要的生物过程中发挥着重要作用。它发生在氨基酸序列中的 N-X-[S/T] 序列上，其中 X 可以是脯氨酸以外的任何氨基酸。然而，并非所有的 N-X-[S/T] 序列都发生糖基化；因此，N-X-[S/T] 序列是蛋白质糖基化的必要但非充分决定因素。在这方面，受现有方法限制的针对 N-X-[S/T] 序列的 N 连接糖基化位点的计算预测是一个尚未得到广泛解决的重要问题，尤其是在创建负集和利用来自蛋白质语言模型 (pLM) 的提炼信息方面。在这里，我们开发了 LMNglyPred，这是一种基于深度学习的方法，用于使用来自预训练的 pLM 的嵌入来预测人类蛋白质中的 N 连接糖基化位点。LMNglyPred 在独立于基准的测试集上的灵敏度、特异性、马修斯相关系数、精度和准确性分别为 76.50%、75.36%、0.49%、60.99%和 75.74%。这些结果表明，LMNglyPred 是一种强大的计算工具，可用于预测受限于 N-X-[S/T] 序列的 N 连接糖基化位点。