Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan.
Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, 106, Taiwan.
Comput Biol Med. 2021 Mar;130:104212. doi: 10.1016/j.compbiomed.2021.104212. Epub 2021 Jan 7.
Glycosylation is a dynamic enzymatic process that attaches glycan to proteins or other organic molecules such as lipoproteins. Research has shown that such a process in ion channel proteins plays a fundamental role in modulating ion channel functions. This study used a computational method to predict N-linked glycosylation sites, the most common type, in ion channel proteins. From segments of ion channel proteins centered around N-linked glycosylation sites, the amino acid embedding vectors of each residue were concatenated to create features for prediction. We experimented with two different models for converting amino acids to their corresponding embeddings: one was fed with ion channel sequences and the other with a large dataset composed of more than one million protein sequences. The latter model stemmed from the idea of transfer learning technique and emerged as a more efficient feature extractor. Our best model was obtained from this transfer learning approach and a hyperparameter tuning process with a random search on 5-fold cross-validation data. It achieved an accuracy, specificity, sensitivity, and Matthews correlation coefficient of 93.4%, 92.8%, 98.6%, and 0.726, respectively. Corresponding scores on an independent test were 92.9%, 92.2%, 99%, and 0.717. These results outperform the position-specific scoring matrix features that are predominantly employed in post-translational modification site predictions. Furthermore, compared to N-GlyDE, GlycoEP, SPRINT-Gly, the most recent N-linked glycosylation site predictors, our model yields higher scores on the above 4 metrics, thus further demonstrating the efficiency of our approach.
糖基化是一种将聚糖附着到蛋白质或其他有机分子(如脂蛋白)上的动态酶促过程。研究表明,这种过程在离子通道蛋白中起着调节离子通道功能的基本作用。本研究使用计算方法预测离子通道蛋白中最常见的 N 连接糖基化位点。从围绕 N 连接糖基化位点的离子通道蛋白片段中,将每个残基的氨基酸嵌入向量连接起来,为预测创建特征。我们尝试了两种将氨基酸转换为相应嵌入的不同模型:一种是用离子通道序列输入,另一种是用由一百多万个蛋白质序列组成的大型数据集输入。后一种模型源于迁移学习技术的思想,是一种更有效的特征提取器。我们最好的模型是从这种迁移学习方法和在 5 倍交叉验证数据上进行随机搜索的超参数调优过程中获得的。它在准确性、特异性、敏感性和 Matthews 相关系数方面的得分为 93.4%、92.8%、98.6%和 0.726,独立测试的对应分数分别为 92.9%、92.2%、99%和 0.717。这些结果优于主要用于翻译后修饰位点预测的位置特异性评分矩阵特征。此外,与 N-GlyDE、GlycoEP、SPRINT-Gly 等最新的 N 连接糖基化位点预测器相比,我们的模型在上述 4 项指标上的得分更高,进一步证明了我们方法的效率。