波特6：利用预训练语言模型（PLMs）进行蛋白质二级结构预测。

Porter 6: Protein Secondary Structure Prediction by Leveraging Pre-Trained Language Models (PLMs).

作者信息

Alanazi Wafa, Meng Di, Pollastri Gianluca

机构信息

School of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, Ireland.

Department of Computer Science, College of Science, Northern Border University, Arar P.O. Box 2014, Saudi Arabia.

出版信息

Int J Mol Sci. 2024 Dec 27;26(1):130. doi: 10.3390/ijms26010130.

DOI:10.3390/ijms26010130

PMID:39795988

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11719765/

Abstract

Accurately predicting protein secondary structure (PSSP) is crucial for understanding protein function, which is foundational to advancements in drug development, disease treatment, and biotechnology. Researchers gain critical insights into protein folding and function within cells by predicting protein secondary structures. The advent of deep learning models, capable of processing complex sequence data and identifying meaningful patterns, offer substantial potential to enhance the accuracy and efficiency of protein structure predictions. In particular, recent breakthroughs in deep learning-driven by the integration of natural language processing (NLP) algorithms-have significantly advanced the field of protein research. Inspired by the remarkable success of NLP techniques, this study harnesses the power of pre-trained language models (PLMs) to advance PSSP prediction. We conduct a comprehensive evaluation of various deep learning models trained on distinct sequence embeddings, including one-hot encoding and PLM-based approaches such as ProtTrans and ESM-2, to develop a cutting-edge prediction system optimized for accuracy and computational efficiency. Our proposed model, Porter 6, is an ensemble of CBRNN-based predictors, leveraging the protein language model ESM-2 as input features. Porter 6 achieves outstanding performance on large-scale, independent test sets. On a 2022 test set, the model attains an impressive 86.60% accuracy in three-state (Q3) and 76.43% in eight-state (Q8) classifications. When tested on a more recent 2024 test set, Porter 6 maintains robust performance, achieving 84.56% in Q3 and 74.18% in Q8 classifications. This represents a significant 3% improvement over its predecessor, outperforming or matching state-of-the-art approaches in the field.

摘要

准确预测蛋白质二级结构（PSSP）对于理解蛋白质功能至关重要，而蛋白质功能是药物开发、疾病治疗和生物技术进步的基础。研究人员通过预测蛋白质二级结构来深入了解细胞内蛋白质的折叠和功能。深度学习模型的出现，能够处理复杂的序列数据并识别有意义的模式，为提高蛋白质结构预测的准确性和效率提供了巨大潜力。特别是，受自然语言处理（NLP）算法集成驱动的深度学习最近取得的突破，显著推动了蛋白质研究领域的发展。受NLP技术显著成功的启发，本研究利用预训练语言模型（PLM）的力量来推进PSSP预测。我们对在不同序列嵌入上训练的各种深度学习模型进行了全面评估，包括独热编码和基于PLM的方法，如ProtTrans和ESM-2，以开发一个针对准确性和计算效率进行优化的前沿预测系统。我们提出的模型Porter 6是基于CBRNN的预测器的集成，利用蛋白质语言模型ESM-2作为输入特征。Porter 6在大规模独立测试集上取得了出色的性能。在2022年的测试集上，该模型在三状态（Q3）分类中达到了令人印象深刻的86.60%的准确率，在八状态（Q8）分类中达到了76.43%。在更新的2024年测试集上进行测试时，Porter 6保持了稳健的性能，在Q3分类中达到了84.56%，在Q8分类中达到了74.18%。这比其前身有显著的3%的提高，在该领域优于或匹配最先进的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5c3/11719765/bc04abc4f416/ijms-26-00130-g001.jpg

相似文献

Porter 6: Protein Secondary Structure Prediction by Leveraging Pre-Trained Language Models (PLMs).波特6：利用预训练语言模型（PLMs）进行蛋白质二级结构预测。

Int J Mol Sci. 2024 Dec 27;26(1):130. doi: 10.3390/ijms26010130.

PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs).淡色艾尔6.0：利用预训练语言模型预测蛋白质相对溶剂可及性

Biomolecules. 2025 Jan 2;15(1):49. doi: 10.3390/biom15010049.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans：通过自监督学习理解生命语言。

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.

MCNN-AAPT: accurate classification and functional prediction of amino acid and peptide transporters in secondary active transporters using protein language models and multi-window deep learning.MCNN-AAPT：利用蛋白质语言模型和多窗口深度学习对次级主动转运蛋白中的氨基酸和肽转运体进行准确分类和功能预测。

J Biomol Struct Dyn. 2024 Nov 22:1-10. doi: 10.1080/07391102.2024.2431664.

MHTAPred-SS: A Highly Targeted Autoencoder-Driven Deep Multi-Task Learning Framework for Accurate Protein Secondary Structure Prediction.MHTAPred-SS：一种用于准确蛋白质二级结构预测的高度靶向的自动编码器驱动的深度多任务学习框架。

Int J Mol Sci. 2024 Dec 15;25(24):13444. doi: 10.3390/ijms252413444.

ILMCNet: A Deep Neural Network Model That Uses PLM to Process Features and Employs CRF to Predict Protein Secondary Structure.ILMCNet：一种利用 PLM 处理特征并采用 CRF 预测蛋白质二级结构的深度神经网络模型。

Genes (Basel). 2024 Oct 21;15(10):1350. doi: 10.3390/genes15101350.

NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning.NetSurfP-3.0：通过蛋白质语言模型和深度学习实现蛋白质结构特征的准确快速预测。

Nucleic Acids Res. 2022 Jul 5;50(W1):W510-W515. doi: 10.1093/nar/gkac439.

Deeper Profiles and Cascaded Recurrent and Convolutional Neural Networks for state-of-the-art Protein Secondary Structure Prediction.用于蛋白质二级结构预测的深度剖面和级联递归与卷积神经网络。

Sci Rep. 2019 Aug 26;9(1):12374. doi: 10.1038/s41598-019-48786-x.

Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.LMCrot：一种基于转换器的蛋白质语言模型的可解释窗口级嵌入的增强型蛋白质巴豆酰化位点预测器。

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.

引用本文的文献

DeepPredict: a state-of-the-art web server for protein secondary structure and relative solvent accessibility prediction.DeepPredict：用于蛋白质二级结构和相对溶剂可及性预测的先进网络服务器。

Front Bioinform. 2025 Jun 6;5:1607402. doi: 10.3389/fbinf.2025.1607402. eCollection 2025.

Comprehensive assessment of AlphaFold's predictions of secondary structure and solvent accessibility at the amino acid-level in eukaryotic, bacterial and archaeal proteins.对AlphaFold在真核生物、细菌和古细菌蛋白质氨基酸水平上的二级结构和溶剂可及性预测进行全面评估。

Comput Struct Biotechnol J. 2025 May 29;27:2443-2449. doi: 10.1016/j.csbj.2025.05.047. eCollection 2025.

Advancements in one-dimensional protein structure prediction using machine learning and deep learning.利用机器学习和深度学习进行一维蛋白质结构预测的进展。

Comput Struct Biotechnol J. 2025 Apr 3;27:1416-1430. doi: 10.1016/j.csbj.2025.04.005. eCollection 2025.

本文引用的文献

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

Before and after AlphaFold2: An overview of protein structure prediction.AlphaFold2 前后：蛋白质结构预测概述

Front Bioinform. 2023 Feb 28;3:1120370. doi: 10.3389/fbinf.2023.1120370. eCollection 2023.

Deep learning for protein secondary structure prediction: Pre and post-AlphaFold.用于蛋白质二级结构预测的深度学习：AlphaFold之前与之后。

Comput Struct Biotechnol J. 2022 Nov 11;20:6271-6286. doi: 10.1016/j.csbj.2022.11.012. eCollection 2022.

Nucleic Acids Res. 2022 Jul 5;50(W1):W510-W515. doi: 10.1093/nar/gkac439.

Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment.无需对齐即可达到基于对齐轮廓的预测蛋白质二级和三级结构性质的准确性。

Sci Rep. 2022 May 9;12(1):7607. doi: 10.1038/s41598-022-11684-w.

Discovering the Ultimate Limits of Protein Secondary Structure Prediction.揭示蛋白质二级结构预测的极限。

Biomolecules. 2021 Nov 3;11(11):1627. doi: 10.3390/biom11111627.

Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans：通过自监督学习理解生命语言。

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.

SPOT-1D-Single: improving the single-sequence-based prediction of protein secondary structure, backbone angles, solvent accessibility and half-sphere exposures using a large training set and ensembled deep learning.SPOT-1D-单序列：利用大型训练集和集成深度学习改进基于单序列的蛋白质二级结构、主链角度、溶剂可及性和半球暴露预测。

Bioinformatics. 2021 Oct 25;37(20):3464-3472. doi: 10.1093/bioinformatics/btab316.

Brewery: deep learning and deeper profiles for the prediction of 1D protein structure annotations.啤酒厂：深度学习和更深入的蛋白质一维结构注释预测。

Bioinformatics. 2020 Jun 1;36(12):3897-3898. doi: 10.1093/bioinformatics/btaa204.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

波特6：利用预训练语言模型（PLMs）进行蛋白质二级结构预测。

Porter 6: Protein Secondary Structure Prediction by Leveraging Pre-Trained Language Models (PLMs).

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献