Suppr超能文献

NCSP-PLM:基于蛋白质语言模型和深度学习的非经典分泌蛋白预测的集成学习框架。

NCSP-PLM: An ensemble learning framework for predicting non-classical secreted proteins based on protein language models and deep learning.

机构信息

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.

出版信息

Math Biosci Eng. 2024 Jan;21(1):1472-1488. doi: 10.3934/mbe.2024063. Epub 2022 Dec 28.

Abstract

Non-classical secreted proteins (NCSPs) refer to a group of proteins that are located in the extracellular environment despite the absence of signal peptides and motifs. They usually play different roles in intercellular communication. Therefore, the accurate prediction of NCSPs is a critical step to understanding in depth their associated secretion mechanisms. Since the experimental recognition of NCSPs is often costly and time-consuming, computational methods are desired. In this study, we proposed an ensemble learning framework, termed NCSP-PLM, for the identification of NCSPs by extracting feature embeddings from pre-trained protein language models (PLMs) as input to several fine-tuned deep learning models. First, we compared the performance of nine PLM embeddings by training three neural networks: Multi-layer perceptron (MLP), attention mechanism and bidirectional long short-term memory network (BiLSTM) and selected the best network model for each PLM embedding. Then, four models were excluded due to their below-average accuracies, and the remaining five models were integrated to perform the prediction of NCSPs based on the weighted voting. Finally, the 5-fold cross validation and the independent test were conducted to evaluate the performance of NCSP-PLM on the benchmark datasets. Based on the same independent dataset, the sensitivity and specificity of NCSP-PLM were 91.18% and 97.06%, respectively. Particularly, the overall accuracy of our model achieved 94.12%, which was 7~16% higher than that of the existing state-of-the-art predictors. It indicated that NCSP-PLM could serve as a useful tool for the annotation of NCSPs.

摘要

非经典分泌蛋白 (NCSPs) 是指一组位于细胞外环境中的蛋白质,尽管它们缺乏信号肽和基序。它们通常在细胞间通讯中发挥不同的作用。因此,准确预测 NCSPs 是深入了解其相关分泌机制的关键步骤。由于实验识别 NCSPs 通常成本高且耗时,因此需要计算方法。在这项研究中,我们提出了一种集成学习框架,称为 NCSP-PLM,通过从预训练的蛋白质语言模型 (PLM) 中提取特征嵌入作为输入,来识别 NCSPs,然后将其输入到几个经过微调的深度学习模型中。首先,我们通过训练三个神经网络(多层感知机 (MLP)、注意力机制和双向长短期记忆网络 (BiLSTM))比较了 9 种 PLM 嵌入的性能,并为每个 PLM 嵌入选择了最佳的网络模型。然后,由于准确率低于平均水平,排除了四个模型,其余五个模型被整合在一起,基于加权投票来进行 NCSPs 的预测。最后,我们在基准数据集上进行了 5 折交叉验证和独立测试,以评估 NCSP-PLM 的性能。基于相同的独立数据集,NCSP-PLM 的敏感性和特异性分别为 91.18%和 97.06%。特别是,我们模型的总体准确率达到 94.12%,比现有的最先进的预测器高出 7-16%。这表明 NCSP-PLM 可以作为注释 NCSPs 的有用工具。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验