Wang Houqiang, Li Hong, Gao Weifeng, Xie Jin
School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
Anal Biochem. 2022 Dec 1;658:114935. doi: 10.1016/j.ab.2022.114935. Epub 2022 Oct 4.
Identification of ubiquitination sites is central to many biological experiments. Ubiquitination is a kind of post-translational protein modification (PTM). It is a key mechanism for increasing protein diversity and plays a vital role in regulating cell function. In recent years, many models have been developed to predict ubiquitination sites in humans, mice and yeast. However, few studies have predicted ubiquitination sites in Arabidopsis thaliana. In view of this, a deep network model named PrUb-EL is proposed to predict ubiquitination sites in Arabidopsis thaliana. Firstly, six features based on the protein sequence are extracted with amino acid index database (AAindex), dipeptide deviates from the expected mean (DDE), dipeptide composition (DPC), blocks substitution matrix (BLOSUM62), enhanced amino acid composition (EAAC) and binary encoding. Secondly, the synthetic minority over-sampling technique (SMOTE) is utilized to process the imbalanced data set. Then a new classifier named DG is presented, which includes Dense block, Residual block and Gated recurrent unit (GRU) block. Finally, each of six feature extraction methods is integrated into the DG model, and the ensemble learning strategy is used to gain the final prediction result. Experimental results show that PrUb-EL has good predictive ability with the accuracy (ACC) and area under the ROC curve (auROC) values of 91.00% and 97.70% using 5-fold cross-validation, respectively. Note that the values of ACC and auROC are 88.58% and 96.09% in the independent test, respectively. Compared with previous studies, our model has significantly improved performance thus it is an excellent method for identifying ubiquitination sites in Arabidopsis thaliana. The datasets and code used for the article are available at https://github.com/Tom-Wangy/PreUb-EL.git.
泛素化位点的识别是许多生物学实验的核心。泛素化是一种蛋白质翻译后修饰(PTM)。它是增加蛋白质多样性的关键机制,在调节细胞功能中起着至关重要的作用。近年来,已经开发了许多模型来预测人类、小鼠和酵母中的泛素化位点。然而,很少有研究预测拟南芥中的泛素化位点。鉴于此,提出了一种名为PrUb-EL的深度网络模型来预测拟南芥中的泛素化位点。首先,使用氨基酸指数数据库(AAindex)、二肽偏离预期均值(DDE)、二肽组成(DPC)、块替换矩阵(BLOSUM62)、增强氨基酸组成(EAAC)和二进制编码提取基于蛋白质序列的六个特征。其次,利用合成少数过采样技术(SMOTE)处理不平衡数据集。然后提出了一种名为DG的新分类器,它包括密集块、残差块和门控循环单元(GRU)块。最后,将六种特征提取方法中的每一种都集成到DG模型中,并使用集成学习策略获得最终预测结果。实验结果表明,PrUb-EL具有良好的预测能力,在5折交叉验证中,准确率(ACC)和ROC曲线下面积(auROC)值分别为91.00%和97.70%。请注意,在独立测试中,ACC和auROC的值分别为88.58%和96.09%。与先前的研究相比,我们的模型性能有了显著提高,因此它是识别拟南芥中泛素化位点的一种优秀方法。本文使用的数据集和代码可在https://github.com/Tom-Wangy/PreUb-EL.git获取。