PeNGaRoo，一种组合梯度提升和集成学习框架，用于预测非经典分泌蛋白。

PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins.

机构信息

Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.

Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.

出版信息

Bioinformatics. 2020 Feb 1;36(3):704-712. doi: 10.1093/bioinformatics/btz629.

DOI:10.1093/bioinformatics/btz629

PMID:31393553

Abstract

MOTIVATION

Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data.

RESULTS

In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors.

AVAILABILITY AND IMPLEMENTATION

http://pengaroo.erc.monash.edu/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

革兰氏阳性菌已经开发出了将蛋白质输送穿过细胞壁的分泌系统，这一过程在宿主感染过程中起着重要作用。这些分泌机制也被用于许多生物技术应用中的治疗目的。因此，鉴定出从这些微生物中有效分泌蛋白质的特征已成为一项重要任务。在所有分泌蛋白中，“非经典”分泌蛋白难以识别，因为它们缺乏可识别的信号肽序列，可以利用多种分泌途径。目前，已经开发了几种计算方法来促进此类非经典分泌蛋白的发现；然而，现有的方法要么基于模拟的数据集，要么基于有限的实验数据集。此外，它们通常采用基本特征以简单而粗糙的方式训练模型。更多经过实验验证的数据集、先进的特征工程技术和新颖的机器学习方法的出现，为从序列数据中开发改进的“非经典”分泌蛋白预测器创造了新的机会。

结果

在这项工作中，我们首先构建了一个高质量的实验验证的“非经典”分泌蛋白数据集，然后使用该数据集创建了基准数据集。使用这些基准数据集，我们全面分析了广泛的特征，并评估了它们的个体性能。随后，我们开发了一个两层 Light Gradient Boosting Machine（LightGBM）集成模型，该模型将几个基于单一特征的模型集成到一个整体预测框架中。在这个阶段，LightGBM，一个梯度提升机，被用作机器学习方法，必要的参数优化通过粒子群优化策略来执行。然后，将所有基于单一特征的 LightGBM 模型集成到一个统一的集成模型中，以进一步提高预测性能。因此，最终的集成模型在独立测试中表现出色，准确率为 0.900、F 值为 0.903、马修相关系数为 0.803 和曲线下面积值为 0.963，优于以前的最先进的预测器。基于我们提出的最优集成模型，我们进一步开发了一个可访问的在线预测器 PeNGaRoo，以满足用户的需求。我们相信，这个在线网络服务器以及我们提出的方法，将加快革兰氏阳性菌中非经典分泌效应蛋白的发现，并进一步激发下一代预测器的发展。