• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

微调蛋白质语言模型可提高跨多种任务的预测能力。

Fine-tuning protein language models boosts predictions across diverse tasks.

机构信息

TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Garching/Munich, Germany.

TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany.

出版信息

Nat Commun. 2024 Aug 28;15(1):7407. doi: 10.1038/s41467-024-51844-2.

DOI:10.1038/s41467-024-51844-2
PMID:39198457
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11358375/
Abstract

Prediction methods inputting embeddings from protein language models have reached or even surpassed state-of-the-art performance on many protein prediction tasks. In natural language processing fine-tuning large language models has become the de facto standard. In contrast, most protein language model-based protein predictions do not back-propagate to the language model. Here, we compare the fine-tuning of three state-of-the-art models (ESM2, ProtT5, Ankh) on eight different tasks. Two results stand out. Firstly, task-specific supervised fine-tuning almost always improves downstream predictions. Secondly, parameter-efficient fine-tuning can reach similar improvements consuming substantially fewer resources at up to 4.5-fold acceleration of training over fine-tuning full models. Our results suggest to always try fine-tuning, in particular for problems with small datasets, such as for fitness landscape predictions of a single protein. For ease of adaptability, we provide easy-to-use notebooks to fine-tune all models used during this work for per-protein (pooling) and per-residue prediction tasks.

摘要

基于蛋白质语言模型的嵌入预测方法在许多蛋白质预测任务上已经达到甚至超过了最新水平。在自然语言处理中,微调大型语言模型已经成为事实上的标准。相比之下,大多数基于蛋白质语言模型的蛋白质预测都不会反向传播到语言模型中。在这里,我们比较了三种最先进的模型(ESM2、ProtT5、Ankh)在八个不同任务上的微调。有两个结果非常突出。首先,特定于任务的监督式微调几乎总是可以改进下游预测。其次,参数高效的微调可以达到类似的改进,消耗的资源要少得多,对于完整模型的微调,可以加速高达 4.5 倍的训练。我们的结果表明,总是应该尝试微调,特别是对于数据集较小的问题,例如单个蛋白质的适应性景观预测。为了便于适应性,我们提供了易于使用的笔记本,用于微调在这项工作中使用的所有模型,用于蛋白质(池化)和残基预测任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/af1470589e07/41467_2024_51844_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/18a214cc2425/41467_2024_51844_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/1af5d7fc109e/41467_2024_51844_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/761e322b86bf/41467_2024_51844_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/730f79988c6a/41467_2024_51844_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/af1470589e07/41467_2024_51844_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/18a214cc2425/41467_2024_51844_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/1af5d7fc109e/41467_2024_51844_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/761e322b86bf/41467_2024_51844_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/730f79988c6a/41467_2024_51844_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d894/11358375/af1470589e07/41467_2024_51844_Fig5_HTML.jpg

相似文献

1
Fine-tuning protein language models boosts predictions across diverse tasks.微调蛋白质语言模型可提高跨多种任务的预测能力。
Nat Commun. 2024 Aug 28;15(1):7407. doi: 10.1038/s41467-024-51844-2.
2
Democratizing protein language models with parameter-efficient fine-tuning.参数高效微调:用民主化方法对蛋白质语言模型进行优化。
Proc Natl Acad Sci U S A. 2024 Jun 25;121(26):e2405840121. doi: 10.1073/pnas.2405840121. Epub 2024 Jun 20.
3
An analysis of protein language model embeddings for fold prediction.蛋白质语言模型嵌入物折叠预测分析。
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac142.
4
Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning.通过参数高效微调实现蛋白质语言模型的民主化
bioRxiv. 2023 Nov 10:2023.11.09.566187. doi: 10.1101/2023.11.09.566187.
5
LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.LMCrot:一种基于转换器的蛋白质语言模型的可解释窗口级嵌入的增强型蛋白质巴豆酰化位点预测器。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.
6
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.
7
Simple, Efficient, and Scalable Structure-Aware Adapter Boosts Protein Language Models.简单、高效、可扩展的结构感知适配器提升蛋白质语言模型。
J Chem Inf Model. 2024 Aug 26;64(16):6338-6349. doi: 10.1021/acs.jcim.4c00689. Epub 2024 Aug 7.
8
Parameter-efficient fine-tuning on large protein language models improves signal peptide prediction.大型蛋白质语言模型的参数高效微调可提高信号肽预测的效果。
Genome Res. 2024 Oct 11;34(9):1445-1454. doi: 10.1101/gr.279132.124.
9
Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks.太乙:一个用于多种生物医学任务的双语精调大型语言模型。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1865-1874. doi: 10.1093/jamia/ocae037.
10
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

引用本文的文献

1
T-cell receptor specificity landscape revealed through de novo peptide design.通过从头肽设计揭示的T细胞受体特异性图谱。
ArXiv. 2025 Sep 4:arXiv:2503.00648v2.
2
Highly accurate prophage island detection with PIDE.使用PIDE进行高度准确的原噬菌体岛检测。
Genome Biol. 2025 Aug 20;26(1):254. doi: 10.1186/s13059-025-03733-0.
3
NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model.NetStart 2.0:使用蛋白质语言模型预测真核生物翻译起始位点

本文引用的文献

1
Democratizing protein language models with parameter-efficient fine-tuning.参数高效微调:用民主化方法对蛋白质语言模型进行优化。
Proc Natl Acad Sci U S A. 2024 Jun 25;121(26):e2405840121. doi: 10.1073/pnas.2405840121. Epub 2024 Jun 20.
2
ProGen2: Exploring the boundaries of protein language models.ProGen2:探索蛋白质语言模型的边界。
Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.
3
TSignal: a transformer model for signal peptide prediction.TSignal:一种用于信号肽预测的 Transformer 模型。
BMC Bioinformatics. 2025 Aug 19;26(1):216. doi: 10.1186/s12859-025-06220-2.
4
mamp-ml: A deep learning approach to epitope immunogenicity in plants.MAMP-ml:一种用于植物中表位免疫原性的深度学习方法。
bioRxiv. 2025 Jul 15:2025.07.11.664399. doi: 10.1101/2025.07.11.664399.
5
PLMFit: benchmarking transfer learning with protein language models for protein engineering.PLMFit:使用蛋白质语言模型进行蛋白质工程的迁移学习基准测试
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf381.
6
Locality-aware pooling enhances protein language model performance across varied applications.局部感知池化可提升蛋白质语言模型在各种应用中的性能。
Bioinformatics. 2025 Jul 1;41(Supplement_1):i217-i226. doi: 10.1093/bioinformatics/btaf178.
7
StackGlyEmbed: prediction of N-linked glycosylation sites using protein language models.StackGlyEmbed:使用蛋白质语言模型预测N-糖基化位点
Bioinform Adv. 2025 Jun 28;5(1):vbaf146. doi: 10.1093/bioadv/vbaf146. eCollection 2025.
8
Large Context, Deeper Insights: Harnessing Large Language Models for Advancing Protein-Protein Interaction Analysis.大背景,更深刻的见解:利用大语言模型推动蛋白质-蛋白质相互作用分析
Methods Mol Biol. 2025;2941:243-267. doi: 10.1007/978-1-0716-4623-6_15.
9
Gaia: An AI-enabled genomic context-aware platform for protein sequence annotation.盖亚:一个用于蛋白质序列注释的人工智能驱动的基因组上下文感知平台。
Sci Adv. 2025 Jun 20;11(25):eadv5109. doi: 10.1126/sciadv.adv5109.
10
Fine-tuning protein language models to understand the functional impact of missense variants.微调蛋白质语言模型以理解错义变体的功能影响。
Comput Struct Biotechnol J. 2025 May 28;27:2199-2207. doi: 10.1016/j.csbj.2025.05.022. eCollection 2025.
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i347-i356. doi: 10.1093/bioinformatics/btad228.
4
Fast and accurate protein structure search with Foldseek.使用 Foldseek 进行快速准确的蛋白质结构搜索。
Nat Biotechnol. 2024 Feb;42(2):243-246. doi: 10.1038/s41587-023-01773-0. Epub 2023 May 8.
5
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
6
Large language models generate functional protein sequences across diverse families.大型语言模型可生成不同家族的功能性蛋白质序列。
Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.
7
Light attention predicts protein location from the language of life.轻注意力从生命语言中预测蛋白质位置。
Bioinform Adv. 2021 Nov 19;1(1):vbab035. doi: 10.1093/bioadv/vbab035. eCollection 2021.
8
AbLang: an antibody language model for completing antibody sequences.AbLang:一种用于完成抗体序列的抗体语言模型。
Bioinform Adv. 2022 Jun 17;2(1):vbac046. doi: 10.1093/bioadv/vbac046. eCollection 2022.
9
BepiPred-3.0: Improved B-cell epitope prediction using protein language models.BepiPred-3.0:使用蛋白质语言模型改进 B 细胞表位预测。
Protein Sci. 2022 Dec;31(12):e4497. doi: 10.1002/pro.4497.
10
SETH predicts nuances of residue disorder from protein embeddings.SETH从蛋白质嵌入中预测残基无序的细微差别。
Front Bioinform. 2022 Oct 10;2:1019597. doi: 10.3389/fbinf.2022.1019597. eCollection 2022.