Suppr超能文献

O-GlcNAcPRED-II:一种基于模糊欠采样和 K-means PCA 过采样技术的 O-GlcNAc 化位点识别的综合分类算法。

O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique.

机构信息

Department of Mathematics, Dalian Maritime University, Dalian, China.

School of Computer Science and Technology, Tianjin University, Tianjin, China.

出版信息

Bioinformatics. 2018 Jun 15;34(12):2029-2036. doi: 10.1093/bioinformatics/bty039.

Abstract

MOTIVATION

Protein O-GlcNAcylation (O-GlcNAc) is an important post-translational modification of serine (S)/threonine (T) residues that involves multiple molecular and cellular processes. Recent studies have suggested that abnormal O-G1cNAcylation causes many diseases, such as cancer and various neurodegenerative diseases. With the available protein O-G1cNAcylation sites experimentally verified, it is highly desired to develop automated methods to rapidly and effectively identify O-GlcNAcylation sites. Although some computational methods have been proposed, their performance has been unsatisfactory, particularly in terms of prediction sensitivity.

RESULTS

In this study, we developed an ensemble model O-GlcNAcPRED-II to identify potential O-GlcNAcylation sites. A K-means principal component analysis oversampling technique (KPCA) and fuzzy undersampling method (FUS) were first proposed and incorporated to reduce the proportion of the original positive and negative training samples. Then, rotation forest, a type of classifier-integrated system, was adopted to divide the eight types of feature space into several subsets using four sub-classifiers: random forest, k-nearest neighbour, naive Bayesian and support vector machine. We observed that O-GlcNAcPRED-II achieved a sensitivity of 81.05%, specificity of 95.91%, accuracy of 91.43% and Matthew's correlation coefficient of 0.7928 for five-fold cross-validation run 10 times. Additionally, the results obtained by O-GlcNAcPRED-II on two independent datasets also indicated that the proposed predictor outperformed five published prediction tools.

AVAILABILITY AND IMPLEMENTATION

http://121.42.167.206/OGlcPred/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

蛋白质 O-连接的 N-乙酰氨基葡萄糖(O-GlcNAc)修饰是丝氨酸(S)/苏氨酸(T)残基上的一种重要的翻译后修饰,涉及多种分子和细胞过程。最近的研究表明,异常的 O-GlcNAc 化会导致许多疾病,如癌症和各种神经退行性疾病。随着可利用的蛋白质 O-GlcNAc 化位点被实验验证,人们非常希望开发自动化方法来快速有效地识别 O-GlcNAc 化位点。尽管已经提出了一些计算方法,但它们的性能并不令人满意,特别是在预测敏感性方面。

结果

在这项研究中,我们开发了一种集成模型 O-GlcNAcPRED-II 来识别潜在的 O-GlcNAc 化位点。首先提出了一种 K-均值主成分分析过采样技术(KPCA)和模糊欠采样方法(FUS),以降低原始阳性和阴性训练样本的比例。然后,采用旋转森林(一种分类器集成系统),使用四个子分类器(随机森林、k-最近邻、朴素贝叶斯和支持向量机)将八类特征空间划分为几个子集。我们观察到,O-GlcNAcPRED-II 在五次交叉验证运行 10 次中,敏感性为 81.05%,特异性为 95.91%,准确性为 91.43%,马修相关系数为 0.7928。此外,O-GlcNAcPRED-II 在两个独立数据集上的结果也表明,所提出的预测器优于五个已发表的预测工具。

可用性和实现

http://121.42.167.206/OGlcPred/。

补充信息

补充数据可在“Bioinformatics”在线获取。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验