O-GlcNAcPRED-II：一种基于模糊欠采样和 K-means PCA 过采样技术的 O-GlcNAc 化位点识别的综合分类算法。

O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique.

机构信息

Department of Mathematics, Dalian Maritime University, Dalian, China.

School of Computer Science and Technology, Tianjin University, Tianjin, China.

出版信息

Bioinformatics. 2018 Jun 15;34(12):2029-2036. doi: 10.1093/bioinformatics/bty039.

DOI:10.1093/bioinformatics/bty039

PMID:29420699

Abstract

MOTIVATION

Protein O-GlcNAcylation (O-GlcNAc) is an important post-translational modification of serine (S)/threonine (T) residues that involves multiple molecular and cellular processes. Recent studies have suggested that abnormal O-G1cNAcylation causes many diseases, such as cancer and various neurodegenerative diseases. With the available protein O-G1cNAcylation sites experimentally verified, it is highly desired to develop automated methods to rapidly and effectively identify O-GlcNAcylation sites. Although some computational methods have been proposed, their performance has been unsatisfactory, particularly in terms of prediction sensitivity.

RESULTS

In this study, we developed an ensemble model O-GlcNAcPRED-II to identify potential O-GlcNAcylation sites. A K-means principal component analysis oversampling technique (KPCA) and fuzzy undersampling method (FUS) were first proposed and incorporated to reduce the proportion of the original positive and negative training samples. Then, rotation forest, a type of classifier-integrated system, was adopted to divide the eight types of feature space into several subsets using four sub-classifiers: random forest, k-nearest neighbour, naive Bayesian and support vector machine. We observed that O-GlcNAcPRED-II achieved a sensitivity of 81.05%, specificity of 95.91%, accuracy of 91.43% and Matthew's correlation coefficient of 0.7928 for five-fold cross-validation run 10 times. Additionally, the results obtained by O-GlcNAcPRED-II on two independent datasets also indicated that the proposed predictor outperformed five published prediction tools.

AVAILABILITY AND IMPLEMENTATION

http://121.42.167.206/OGlcPred/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

蛋白质 O-连接的 N-乙酰氨基葡萄糖（O-GlcNAc）修饰是丝氨酸（S）/苏氨酸（T）残基上的一种重要的翻译后修饰，涉及多种分子和细胞过程。最近的研究表明，异常的 O-GlcNAc 化会导致许多疾病，如癌症和各种神经退行性疾病。随着可利用的蛋白质 O-GlcNAc 化位点被实验验证，人们非常希望开发自动化方法来快速有效地识别 O-GlcNAc 化位点。尽管已经提出了一些计算方法，但它们的性能并不令人满意，特别是在预测敏感性方面。

结果

在这项研究中，我们开发了一种集成模型 O-GlcNAcPRED-II 来识别潜在的 O-GlcNAc 化位点。首先提出了一种 K-均值主成分分析过采样技术（KPCA）和模糊欠采样方法（FUS），以降低原始阳性和阴性训练样本的比例。然后，采用旋转森林（一种分类器集成系统），使用四个子分类器（随机森林、k-最近邻、朴素贝叶斯和支持向量机）将八类特征空间划分为几个子集。我们观察到，O-GlcNAcPRED-II 在五次交叉验证运行 10 次中，敏感性为 81.05%，特异性为 95.91%，准确性为 91.43%，马修相关系数为 0.7928。此外，O-GlcNAcPRED-II 在两个独立数据集上的结果也表明，所提出的预测器优于五个已发表的预测工具。

可用性和实现

http://121.42.167.206/OGlcPred/。

补充信息

补充数据可在“Bioinformatics”在线获取。

相似文献

O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique.O-GlcNAcPRED-II：一种基于模糊欠采样和 K-means PCA 过采样技术的 O-GlcNAc 化位点识别的综合分类算法。

Bioinformatics. 2018 Jun 15;34(12):2029-2036. doi: 10.1093/bioinformatics/bty039.

CarSite-II: an integrated classification algorithm for identifying carbonylated sites based on K-means similarity-based undersampling and synthetic minority oversampling techniques.CarSite-II：一种基于 K-均值相似性欠采样和合成少数类过采样技术的用于识别羰基化位点的集成分类算法。

BMC Bioinformatics. 2021 Apr 26;22(1):216. doi: 10.1186/s12859-021-04134-3.

Characterization and identification of protein O-GlcNAcylation sites with substrate specificity.具有底物特异性的蛋白质O-连接N-乙酰葡糖胺化位点的表征与鉴定。

BMC Bioinformatics. 2014;15 Suppl 16(Suppl 16):S1. doi: 10.1186/1471-2105-15-S16-S1. Epub 2014 Dec 8.

O-GlcNAcPRED: a sensitive predictor to capture protein O-GlcNAcylation sites.O-GlcNAcPRED：一种用于捕捉蛋白质O-连接N-乙酰葡糖胺化位点的灵敏预测工具。

Mol Biosyst. 2013 Nov;9(11):2909-13. doi: 10.1039/c3mb70326f.

O-GlcNAcPRED-DL: Prediction of Protein O-GlcNAcylation Sites Based on an Ensemble Model of Deep Learning.O-GlcNAcPRED-DL：基于深度学习集成模型的蛋白质 O-GlcNAc 化位点预测。

J Proteome Res. 2024 Jan 5;23(1):95-106. doi: 10.1021/acs.jproteome.3c00458. Epub 2023 Dec 6.

A two-layered machine learning method to identify protein O-GlcNAcylation sites with O-GlcNAc transferase substrate motifs.一种用于识别具有O-连接N-乙酰葡糖胺转移酶底物基序的蛋白质O-连接N-乙酰葡糖胺化位点的两层机器学习方法。

BMC Bioinformatics. 2015;16 Suppl 18(Suppl 18):S10. doi: 10.1186/1471-2105-16-S18-S10. Epub 2015 Dec 9.

dbOGAP - an integrated bioinformatics resource for protein O-GlcNAcylation.dbOGAP - 一个用于蛋白质 O-GlcNAcylation 的综合生物信息学资源。

BMC Bioinformatics. 2011 Apr 6;12:91. doi: 10.1186/1471-2105-12-91.

Validation of the reliability of computational O-GlcNAc prediction.计算性O-连接N-乙酰葡糖胺预测可靠性的验证。

Biochim Biophys Acta. 2014 Feb;1844(2):416-21. doi: 10.1016/j.bbapap.2013.12.002. Epub 2013 Dec 9.

Computational Prediction of Protein O-GlcNAc Modification.蛋白质O-连接N-乙酰葡糖胺修饰的计算预测

Methods Mol Biol. 2018;1754:235-246. doi: 10.1007/978-1-4939-7717-8_14.

PGlcS: Prediction of protein O-GlcNAcylation sites with multiple features and analysis.PGlcS：基于多种特征预测蛋白质O-连接的N-乙酰葡糖胺化位点及分析

J Theor Biol. 2015 Sep 7;380:524-9. doi: 10.1016/j.jtbi.2015.06.026. Epub 2015 Jun 24.

引用本文的文献

Enhanced O-glycosylation site prediction using explainable machine learning technique with spatial local environment.使用具有空间局部环境的可解释机器学习技术增强O-糖基化位点预测

Bioinformatics. 2025 Feb 4;41(2). doi: 10.1093/bioinformatics/btaf034.

Site-specific prediction of O-GlcNAc modification in proteins using evolutionary scale model.使用进化尺度模型对蛋白质中O-连接的N-乙酰葡糖胺修饰进行位点特异性预测。

PLoS One. 2024 Dec 31;19(12):e0316215. doi: 10.1371/journal.pone.0316215. eCollection 2024.

O-GlcNAc informatics: advances and trends.O-连接的N-乙酰葡糖胺信息学：进展与趋势

Anal Bioanal Chem. 2025 Feb;417(5):895-905. doi: 10.1007/s00216-024-05531-2. Epub 2024 Sep 18.

Boltzmann Model Predicts Glycan Structures from Lectin Binding.玻尔兹曼模型从凝集素结合预测聚糖结构。

Anal Chem. 2024 May 28;96(21):8332-8341. doi: 10.1021/acs.analchem.3c04992. Epub 2024 May 8.

Integrating Embeddings from Multiple Protein Language Models to Improve Protein -GlcNAc Site Prediction.整合来自多个蛋白质语言模型的嵌入以提高蛋白质-GlcNAc 位点预测。

Int J Mol Sci. 2023 Nov 6;24(21):16000. doi: 10.3390/ijms242116000.

A Boltzmann model predicts glycan structures from lectin binding.玻尔兹曼模型可根据凝集素结合预测聚糖结构。

bioRxiv. 2024 Mar 12:2023.06.03.543532. doi: 10.1101/2023.06.03.543532.

A GHKNN model based on the physicochemical property extraction method to identify SNARE proteins.一种基于物理化学性质提取方法的GHKNN模型，用于识别SNARE蛋白。

Front Genet. 2022 Nov 23;13:935717. doi: 10.3389/fgene.2022.935717. eCollection 2022.

An analytical study on the identification of N-linked glycosylation sites using machine learning model.基于机器学习模型的N-糖基化位点识别分析研究

PeerJ Comput Sci. 2022 Sep 21;8:e1069. doi: 10.7717/peerj-cs.1069. eCollection 2022.

Glycoinformatics in the Artificial Intelligence Era.人工智能时代的糖组学信息学。

Chem Rev. 2022 Oct 26;122(20):15971-15988. doi: 10.1021/acs.chemrev.2c00110. Epub 2022 Aug 12.

Identification of Type 2 Diabetes Biomarkers From Mixed Single-Cell Sequencing Data With Feature Selection Methods.利用特征选择方法从混合单细胞测序数据中鉴定2型糖尿病生物标志物

Front Bioeng Biotechnol. 2022 Jun 2;10:890901. doi: 10.3389/fbioe.2022.890901. eCollection 2022.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

O-GlcNAcPRED-II：一种基于模糊欠采样和 K-means PCA 过采样技术的 O-GlcNAc 化位点识别的综合分类算法。

O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献