Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, 3800, Australia.
Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, 3800, Australia.
BMC Bioinformatics. 2019 Mar 6;20(1):112. doi: 10.1186/s12859-019-2700-1.
As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. The abundance and ubiquity of protein glycosylation across three domains of life involving Eukarya, Bacteria and Archaea demonstrate its roles in regulating a variety of signalling and metabolic pathways. Mutations on and in the proximity of glycosylation sites are highly associated with human diseases. Accordingly, accurate prediction of glycosylation can complement laboratory-based methods and greatly benefit experimental efforts for characterization and understanding of functional roles of glycosylation. For this purpose, a number of supervised-learning approaches have been proposed to identify glycosylation sites, demonstrating a promising predictive performance. To train a conventional supervised-learning model, both reliable positive and negative samples are required. However, in practice, a large portion of negative samples (i.e. non-glycosylation sites) are mislabelled due to the limitation of current experimental technologies. Moreover, supervised algorithms often fail to take advantage of large volumes of unlabelled data, which can aid in model learning in conjunction with positive samples (i.e. experimentally verified glycosylation sites).
In this study, we propose a positive unlabelled (PU) learning-based method, PA2DE (V2.0), based on the AlphaMax algorithm for protein glycosylation site prediction. The predictive performance of this proposed method was evaluated by a range of glycosylation data collected over a ten-year period based on an interval of three years. Experiments using both benchmarking and independent tests show that our method outperformed the representative supervised-learning algorithms (including support vector machines and random forests) and one-class learners, as well as currently available prediction methods in terms of F1 score, accuracy and AUC measures. In addition, we developed an online web server as an implementation of the optimized model (available at http://glycomine.erc.monash.edu/Lab/GlycoMine_PU/ ) to facilitate community-wide efforts for accurate prediction of protein glycosylation sites.
The proposed PU learning approach achieved a competitive predictive performance compared with currently available methods. This PU learning schema may also be effectively employed and applied to address the prediction problems of other important types of protein PTM site and functional sites.
作为一种重要的翻译后修饰(PTM)类型,蛋白质糖基化在蛋白质稳定性和功能中起着关键作用。糖基化在涉及真核生物、细菌和古菌的三个生命领域中的丰富性和普遍性表明,它在调节各种信号和代谢途径中起着重要作用。糖基化位点上和附近的突变与人类疾病高度相关。因此,准确预测糖基化可以补充基于实验室的方法,并极大地有益于实验工作,以表征和理解糖基化的功能作用。为此,已经提出了许多有监督学习方法来识别糖基化位点,展示了有前途的预测性能。为了训练传统的有监督学习模型,既需要可靠的阳性样本,也需要可靠的阴性样本。然而,在实践中,由于当前实验技术的限制,很大一部分阴性样本(即非糖基化位点)被错误标记。此外,监督算法往往无法利用大量未标记的数据,这些数据可以与阳性样本(即经实验验证的糖基化位点)一起帮助模型学习。
在这项研究中,我们提出了一种基于 AlphaMax 算法的正无标记(PU)学习方法 PA2DE(V2.0),用于蛋白质糖基化位点预测。通过基于三年间隔收集的十年期间的一系列糖基化数据,评估了该方法的预测性能。使用基准测试和独立测试的实验表明,与代表性的监督学习算法(包括支持向量机和随机森林)和单类学习者以及目前可用的预测方法相比,我们的方法在 F1 分数、准确性和 AUC 度量方面表现更好。此外,我们开发了一个在线网络服务器作为优化模型的实现(可在 http://glycomine.erc.monash.edu/Lab/GlycoMine_PU/ 获得),以促进社区内对蛋白质糖基化位点进行准确预测的努力。
与目前可用的方法相比,提出的 PU 学习方法实现了有竞争力的预测性能。这种 PU 学习方案也可以有效地被采用和应用于解决其他重要类型的蛋白质翻译后修饰(PTM)位点和功能位点的预测问题。