Suppr超能文献

波特6:利用预训练语言模型(PLMs)进行蛋白质二级结构预测。

Porter 6: Protein Secondary Structure Prediction by Leveraging Pre-Trained Language Models (PLMs).

作者信息

Alanazi Wafa, Meng Di, Pollastri Gianluca

机构信息

School of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, Ireland.

Department of Computer Science, College of Science, Northern Border University, Arar P.O. Box 2014, Saudi Arabia.

出版信息

Int J Mol Sci. 2024 Dec 27;26(1):130. doi: 10.3390/ijms26010130.

Abstract

Accurately predicting protein secondary structure (PSSP) is crucial for understanding protein function, which is foundational to advancements in drug development, disease treatment, and biotechnology. Researchers gain critical insights into protein folding and function within cells by predicting protein secondary structures. The advent of deep learning models, capable of processing complex sequence data and identifying meaningful patterns, offer substantial potential to enhance the accuracy and efficiency of protein structure predictions. In particular, recent breakthroughs in deep learning-driven by the integration of natural language processing (NLP) algorithms-have significantly advanced the field of protein research. Inspired by the remarkable success of NLP techniques, this study harnesses the power of pre-trained language models (PLMs) to advance PSSP prediction. We conduct a comprehensive evaluation of various deep learning models trained on distinct sequence embeddings, including one-hot encoding and PLM-based approaches such as ProtTrans and ESM-2, to develop a cutting-edge prediction system optimized for accuracy and computational efficiency. Our proposed model, Porter 6, is an ensemble of CBRNN-based predictors, leveraging the protein language model ESM-2 as input features. Porter 6 achieves outstanding performance on large-scale, independent test sets. On a 2022 test set, the model attains an impressive 86.60% accuracy in three-state (Q3) and 76.43% in eight-state (Q8) classifications. When tested on a more recent 2024 test set, Porter 6 maintains robust performance, achieving 84.56% in Q3 and 74.18% in Q8 classifications. This represents a significant 3% improvement over its predecessor, outperforming or matching state-of-the-art approaches in the field.

摘要

准确预测蛋白质二级结构(PSSP)对于理解蛋白质功能至关重要,而蛋白质功能是药物开发、疾病治疗和生物技术进步的基础。研究人员通过预测蛋白质二级结构来深入了解细胞内蛋白质的折叠和功能。深度学习模型的出现,能够处理复杂的序列数据并识别有意义的模式,为提高蛋白质结构预测的准确性和效率提供了巨大潜力。特别是,受自然语言处理(NLP)算法集成驱动的深度学习最近取得的突破,显著推动了蛋白质研究领域的发展。受NLP技术显著成功的启发,本研究利用预训练语言模型(PLM)的力量来推进PSSP预测。我们对在不同序列嵌入上训练的各种深度学习模型进行了全面评估,包括独热编码和基于PLM的方法,如ProtTrans和ESM-2,以开发一个针对准确性和计算效率进行优化的前沿预测系统。我们提出的模型Porter 6是基于CBRNN的预测器的集成,利用蛋白质语言模型ESM-2作为输入特征。Porter 6在大规模独立测试集上取得了出色的性能。在2022年的测试集上,该模型在三状态(Q3)分类中达到了令人印象深刻的86.60%的准确率,在八状态(Q8)分类中达到了76.43%。在更新的2024年测试集上进行测试时,Porter 6保持了稳健的性能,在Q3分类中达到了84.56%,在Q8分类中达到了74.18%。这比其前身有显著的3%的提高,在该领域优于或匹配最先进的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5c3/11719765/bc04abc4f416/ijms-26-00130-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验