Indriani Fatma, Mahmudah Kunti Robiatul, Purnama Bedy, Satou Kenji
Graduate School of Natural Science and Technology, Kanazawa University, Kanazawa, Japan.
Department of Computer Science, Lambung Mangkurat University, Banjarmasin, Indonesia.
Front Genet. 2022 May 31;13:885929. doi: 10.3389/fgene.2022.885929. eCollection 2022.
Lysine glutarylation is a post-translational modification (PTM) that plays a regulatory role in various physiological and biological processes. Identifying glutarylated peptides using proteomic techniques is expensive and time-consuming. Therefore, developing computational models and predictors can prove useful for rapid identification of glutarylation. In this study, we propose a model called ProtTrans-Glutar to classify a protein sequence into positive or negative glutarylation site by combining traditional sequence-based features with features derived from a pre-trained transformer-based protein model. The features of the model were constructed by combining several feature sets, namely the distribution feature (from composition/transition/distribution encoding), enhanced amino acid composition (EAAC), and features derived from the ProtT5-XL-UniRef50 model. Combined with random under-sampling and XGBoost classification method, our model obtained recall, specificity, and AUC scores of 0.7864, 0.6286, and 0.7075 respectively on an independent test set. The recall and AUC scores were notably higher than those of the previous glutarylation prediction models using the same dataset. This high recall score suggests that our method has the potential to identify new glutarylation sites and facilitate further research on the glutarylation process.
赖氨酸戊二酰化是一种翻译后修饰(PTM),在各种生理和生物学过程中发挥调节作用。使用蛋白质组学技术鉴定戊二酰化肽既昂贵又耗时。因此,开发计算模型和预测器对于快速鉴定戊二酰化可能是有用的。在本研究中,我们提出了一种名为ProtTrans-Glutar的模型,通过将基于传统序列的特征与从预训练的基于Transformer的蛋白质模型衍生的特征相结合,将蛋白质序列分类为阳性或阴性戊二酰化位点。该模型的特征是通过组合几个特征集构建的,即分布特征(来自组成/转换/分布编码)、增强氨基酸组成(EAAC)以及从ProtT5-XL-UniRef50模型衍生的特征。结合随机欠采样和XGBoost分类方法,我们的模型在独立测试集上分别获得了0.7864、0.6286和0.7075的召回率、特异性和AUC分数。召回率和AUC分数明显高于使用相同数据集的先前戊二酰化预测模型。这种高召回率表明我们的方法有潜力识别新的戊二酰化位点,并促进对戊二酰化过程的进一步研究。