• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质语言模型嵌入物折叠预测分析。

An analysis of protein language model embeddings for fold prediction.

机构信息

Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain.

出版信息

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac142.

DOI:10.1093/bib/bbac142
PMID:35443054
Abstract

The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.

摘要

蛋白质折叠类别的识别是结构生物学中的一个具有挑战性的问题。最近用于折叠预测的计算方法利用深度学习技术,主要使用多重序列比对(MSA)形式的进化信息作为输入源,提取具有代表性的蛋白质折叠嵌入。相比之下,蛋白质语言模型(LM)通过从纯粹的序列信息中以自监督的方式学习有效的蛋白质表示(蛋白质-LM 嵌入),从而彻底改变了该领域。在本文中,我们分析了一个使用预训练的蛋白质-LM 嵌入作为输入的蛋白质折叠预测框架,该框架将输入到几个经过微调的神经网络模型中,这些模型通过折叠标签进行监督训练。特别是,我们比较了六种蛋白质-LM 嵌入的性能:基于长短期记忆的 UniRep 和 SeqVec,以及基于转换器的 ESM-1b、ESM-MSA、ProtBERT 和 ProtT5;以及三种神经网络:多层感知机、ResCNN-BGRU(RBG)和 Light-Attention(LAT)。我们分别在著名的基准数据集上评估了折叠识别(PFR)和直接折叠分类(DFC)任务的性能。结果表明,基于转换器的嵌入(特别是在氨基酸水平上获得的嵌入)与 RBG 和 LAT 微调模型相结合,在这两个任务中都表现得非常出色。为了进一步提高预测准确性,我们针对 PFR 和 DFC 提出了几种集成策略,这些策略显著提高了当前的最先进结果。所有这些都表明,从传统的蛋白质表示向蛋白质-LM 嵌入的转变是一种非常有前途的蛋白质折叠相关任务的方法。

相似文献

1
An analysis of protein language model embeddings for fold prediction.蛋白质语言模型嵌入物折叠预测分析。
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac142.
2
FoldHSphere: deep hyperspherical embeddings for protein fold recognition.FoldHSphere:用于蛋白质折叠识别的深度超球嵌入。
BMC Bioinformatics. 2021 Oct 12;22(1):490. doi: 10.1186/s12859-021-04419-7.
3
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
4
Assessing the role of evolutionary information for enhancing protein language model embeddings.评估进化信息在增强蛋白质语言模型嵌入中的作用。
Sci Rep. 2024 Sep 5;14(1):20692. doi: 10.1038/s41598-024-71783-8.
5
NCSP-PLM: An ensemble learning framework for predicting non-classical secreted proteins based on protein language models and deep learning.NCSP-PLM:基于蛋白质语言模型和深度学习的非经典分泌蛋白预测的集成学习框架。
Math Biosci Eng. 2024 Jan;21(1):1472-1488. doi: 10.3934/mbe.2024063. Epub 2022 Dec 28.
6
Fine-tuning protein language models boosts predictions across diverse tasks.微调蛋白质语言模型可提高跨多种任务的预测能力。
Nat Commun. 2024 Aug 28;15(1):7407. doi: 10.1038/s41467-024-51844-2.
7
Integrating Embeddings from Multiple Protein Language Models to Improve Protein -GlcNAc Site Prediction.整合来自多个蛋白质语言模型的嵌入以提高蛋白质-GlcNAc 位点预测。
Int J Mol Sci. 2023 Nov 6;24(21):16000. doi: 10.3390/ijms242116000.
8
LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.LMCrot:一种基于转换器的蛋白质语言模型的可解释窗口级嵌入的增强型蛋白质巴豆酰化位点预测器。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.
9
Improved biomedical word embeddings in the transformer era.Transformer 时代改进的生物医学词向量。
J Biomed Inform. 2021 Aug;120:103867. doi: 10.1016/j.jbi.2021.103867. Epub 2021 Jul 18.
10
FuseLinker: Leveraging LLM's pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs.FuseLinker:利用大语言模型的预训练文本嵌入和领域知识增强基于图神经网络的生物医学知识图谱的链接预测。
J Biomed Inform. 2024 Oct;158:104730. doi: 10.1016/j.jbi.2024.104730. Epub 2024 Sep 24.

引用本文的文献

1
Exo-Tox: Identifying Exotoxins from secreted bacterial proteins.外毒素:从分泌的细菌蛋白中鉴定外毒素
BioData Min. 2025 Aug 8;18(1):52. doi: 10.1186/s13040-025-00469-2.
2
Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.蛋白质序列分析全景:任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述
Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.
3
GRU4ACE: Enhancing ACE inhibitory peptide prediction by integrating gated recurrent unit with multi-source feature embeddings.
GRU4ACE:通过将门控循环单元与多源特征嵌入相结合来增强血管紧张素转换酶抑制肽预测
Protein Sci. 2025 Jun;34(6):e70026. doi: 10.1002/pro.70026.
4
PLM-ATG: Identification of Autophagy Proteins by Integrating Protein Language Model Embeddings with PSSM-Based Features.PLM-ATG:通过将蛋白质语言模型嵌入与基于位置特异性得分矩阵的特征相结合来鉴定自噬蛋白
Molecules. 2025 Apr 10;30(8):1704. doi: 10.3390/molecules30081704.
5
Aggregating residue-level protein language model embeddings with optimal transport.通过最优传输聚合残基水平的蛋白质语言模型嵌入
Bioinform Adv. 2025 Mar 20;5(1):vbaf060. doi: 10.1093/bioadv/vbaf060. eCollection 2025.
6
Enhancing Functional Protein Design Using Heuristic Optimization and Deep Learning for Anti-Inflammatory and Gene Therapy Applications.利用启发式优化和深度学习增强功能性蛋白质设计以用于抗炎和基因治疗应用
Proteins. 2025 Jul;93(7):1238-1256. doi: 10.1002/prot.26810. Epub 2025 Feb 22.
7
Machine learning approaches for predicting protein-ligand binding sites from sequence data.从序列数据预测蛋白质-配体结合位点的机器学习方法。
Front Bioinform. 2025 Feb 3;5:1520382. doi: 10.3389/fbinf.2025.1520382. eCollection 2025.
8
TargetCLP: clathrin proteins prediction combining transformed and evolutionary scale modeling-based multi-view features via weighted feature integration approach.TargetCLP:通过加权特征整合方法结合基于变换和进化尺度建模的多视图特征进行网格蛋白蛋白质预测。
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf026.
9
Site-specific prediction of O-GlcNAc modification in proteins using evolutionary scale model.使用进化尺度模型对蛋白质中O-连接的N-乙酰葡糖胺修饰进行位点特异性预测。
PLoS One. 2024 Dec 31;19(12):e0316215. doi: 10.1371/journal.pone.0316215. eCollection 2024.
10
Benchmarking recent computational tools for DNA-binding protein identification.对近期用于DNA结合蛋白识别的计算工具进行基准测试。
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae634.