Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.
Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria, Australia.
Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbac031.
Protein secretion has a pivotal role in many biological processes and is particularly important for intercellular communication, from the cytoplasm to the host or external environment. Gram-positive bacteria can secrete proteins through multiple secretion pathways. The non-classical secretion pathway has recently received increasing attention among these secretion pathways, but its exact mechanism remains unclear. Non-classical secreted proteins (NCSPs) are a class of secreted proteins lacking signal peptides and motifs. Several NCSP predictors have been proposed to identify NCSPs and most of them employed the whole amino acid sequence of NCSPs to construct the model. However, the sequence length of different proteins varies greatly. In addition, not all regions of the protein are equally important and some local regions are not relevant to the secretion. The functional regions of the protein, particularly in the N- and C-terminal regions, contain important determinants for secretion. In this study, we propose a new hybrid deep learning-based framework, referred to as ASPIRER, which improves the prediction of NCSPs from amino acid sequences. More specifically, it combines a whole sequence-based XGBoost model and an N-terminal sequence-based convolutional neural network model; 5-fold cross-validation and independent tests demonstrate that ASPIRER achieves superior performance than existing state-of-the-art approaches. The source code and curated datasets of ASPIRER are publicly available at https://github.com/yanwu20/ASPIRER/. ASPIRER is anticipated to be a useful tool for improved prediction of novel putative NCSPs from sequences information and prioritization of candidate proteins for follow-up experimental validation.
蛋白质分泌在许多生物过程中起着关键作用,特别是对于细胞间通讯,从细胞质到宿主或外部环境。革兰氏阳性菌可以通过多种分泌途径分泌蛋白质。在这些分泌途径中,非经典分泌途径最近受到越来越多的关注,但确切的机制仍不清楚。非经典分泌蛋白(NCSP)是一类缺乏信号肽和基序的分泌蛋白。已经提出了几种 NCSP 预测器来识别 NCSP,其中大多数使用 NCSP 的整个氨基酸序列来构建模型。然而,不同蛋白质的序列长度差异很大。此外,蛋白质的所有区域并不都同等重要,有些局部区域与分泌无关。蛋白质的功能区域,特别是 N 端和 C 端区域,包含与分泌有关的重要决定因素。在这项研究中,我们提出了一种新的基于深度学习的混合框架,称为 ASPIRER,用于从氨基酸序列中改进 NCSP 的预测。更具体地说,它结合了基于整个序列的 XGBoost 模型和基于 N 端序列的卷积神经网络模型;5 折交叉验证和独立测试表明,ASPIRER 比现有的最先进方法具有更好的性能。ASPIRER 的源代码和精选数据集可在 https://github.com/yanwu20/ASPIRER/ 上公开获取。ASPIRER 有望成为一种有用的工具,用于从序列信息中提高对新型潜在 NCSP 的预测,并对候选蛋白质进行后续实验验证的优先级排序。