Department of Mathematics, Dalian Maritime University, Dalian, China.
School of Computer Science and Technology, Tianjin University, Tianjin, China.
Bioinformatics. 2018 Jun 15;34(12):2029-2036. doi: 10.1093/bioinformatics/bty039.
Protein O-GlcNAcylation (O-GlcNAc) is an important post-translational modification of serine (S)/threonine (T) residues that involves multiple molecular and cellular processes. Recent studies have suggested that abnormal O-G1cNAcylation causes many diseases, such as cancer and various neurodegenerative diseases. With the available protein O-G1cNAcylation sites experimentally verified, it is highly desired to develop automated methods to rapidly and effectively identify O-GlcNAcylation sites. Although some computational methods have been proposed, their performance has been unsatisfactory, particularly in terms of prediction sensitivity.
In this study, we developed an ensemble model O-GlcNAcPRED-II to identify potential O-GlcNAcylation sites. A K-means principal component analysis oversampling technique (KPCA) and fuzzy undersampling method (FUS) were first proposed and incorporated to reduce the proportion of the original positive and negative training samples. Then, rotation forest, a type of classifier-integrated system, was adopted to divide the eight types of feature space into several subsets using four sub-classifiers: random forest, k-nearest neighbour, naive Bayesian and support vector machine. We observed that O-GlcNAcPRED-II achieved a sensitivity of 81.05%, specificity of 95.91%, accuracy of 91.43% and Matthew's correlation coefficient of 0.7928 for five-fold cross-validation run 10 times. Additionally, the results obtained by O-GlcNAcPRED-II on two independent datasets also indicated that the proposed predictor outperformed five published prediction tools.
http://121.42.167.206/OGlcPred/.
Supplementary data are available at Bioinformatics online.
蛋白质 O-连接的 N-乙酰氨基葡萄糖(O-GlcNAc)修饰是丝氨酸(S)/苏氨酸(T)残基上的一种重要的翻译后修饰,涉及多种分子和细胞过程。最近的研究表明,异常的 O-GlcNAc 化会导致许多疾病,如癌症和各种神经退行性疾病。随着可利用的蛋白质 O-GlcNAc 化位点被实验验证,人们非常希望开发自动化方法来快速有效地识别 O-GlcNAc 化位点。尽管已经提出了一些计算方法,但它们的性能并不令人满意,特别是在预测敏感性方面。
在这项研究中,我们开发了一种集成模型 O-GlcNAcPRED-II 来识别潜在的 O-GlcNAc 化位点。首先提出了一种 K-均值主成分分析过采样技术(KPCA)和模糊欠采样方法(FUS),以降低原始阳性和阴性训练样本的比例。然后,采用旋转森林(一种分类器集成系统),使用四个子分类器(随机森林、k-最近邻、朴素贝叶斯和支持向量机)将八类特征空间划分为几个子集。我们观察到,O-GlcNAcPRED-II 在五次交叉验证运行 10 次中,敏感性为 81.05%,特异性为 95.91%,准确性为 91.43%,马修相关系数为 0.7928。此外,O-GlcNAcPRED-II 在两个独立数据集上的结果也表明,所提出的预测器优于五个已发表的预测工具。
http://121.42.167.206/OGlcPred/。
补充数据可在“Bioinformatics”在线获取。