• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估进化信息在增强蛋白质语言模型嵌入中的作用。

Assessing the role of evolutionary information for enhancing protein language model embeddings.

机构信息

TUM School of Computation, Information and Technology, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany.

TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.

出版信息

Sci Rep. 2024 Sep 5;14(1):20692. doi: 10.1038/s41598-024-71783-8.

DOI:10.1038/s41598-024-71783-8
PMID:39237735
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11377704/
Abstract

Embeddings from protein Language Models (pLMs) are replacing evolutionary information from multiple sequence alignments (MSAs) as the most successful input for protein prediction. Is this because embeddings capture evolutionary information? We tested various approaches to explicitly incorporate evolutionary information into embeddings on various protein prediction tasks. While older pLMs (SeqVec, ProtBert) significantly improved through MSAs, the more recent pLM ProtT5 did not benefit. For most tasks, pLM-based outperformed MSA-based methods, and the combination of both even decreased performance for some (intrinsic disorder). We highlight the effectiveness of pLM-based methods and find limited benefits from integrating MSAs.

摘要

基于蛋白质语言模型 (pLM) 的嵌入正在取代来自多重序列比对 (MSA) 的进化信息,成为蛋白质预测中最成功的输入。这是因为嵌入可以捕捉进化信息吗?我们测试了各种方法,试图在各种蛋白质预测任务中显式地将进化信息纳入嵌入。虽然较旧的 pLM(SeqVec、ProtBert)通过 MSA 显著提高,但更新的 pLM ProtT5 并没有受益。对于大多数任务,基于 pLM 的方法优于基于 MSA 的方法,而两者的结合甚至在某些情况下(内在无序)降低了性能。我们强调了基于 pLM 的方法的有效性,并发现从整合 MSA 中获得的收益有限。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e1d/11377704/b7893db4528e/41598_2024_71783_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e1d/11377704/9fd7a1dd9e93/41598_2024_71783_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e1d/11377704/cfa4aa6ad27d/41598_2024_71783_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e1d/11377704/2173ef790e8a/41598_2024_71783_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e1d/11377704/b7893db4528e/41598_2024_71783_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e1d/11377704/9fd7a1dd9e93/41598_2024_71783_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e1d/11377704/cfa4aa6ad27d/41598_2024_71783_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e1d/11377704/2173ef790e8a/41598_2024_71783_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e1d/11377704/b7893db4528e/41598_2024_71783_Fig4_HTML.jpg

相似文献

1
Assessing the role of evolutionary information for enhancing protein language model embeddings.评估进化信息在增强蛋白质语言模型嵌入中的作用。
Sci Rep. 2024 Sep 5;14(1):20692. doi: 10.1038/s41598-024-71783-8.
2
Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction.基于蛋白质语言模型的嵌入来实现快速、准确且无需对齐的蛋白质结构预测。
Structure. 2022 Aug 4;30(8):1169-1177.e4. doi: 10.1016/j.str.2022.05.001. Epub 2022 May 23.
3
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.
4
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
5
An analysis of protein language model embeddings for fold prediction.蛋白质语言模型嵌入物折叠预测分析。
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac142.
6
learnMSA2: deep protein multiple alignments with large language and hidden Markov models.learnMSA2:基于大型语言模型和隐马尔可夫模型的深度蛋白质多重比对。
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii79-ii86. doi: 10.1093/bioinformatics/btae381.
7
Embeddings from protein language models predict conservation and variant effects.基于蛋白质语言模型的嵌入模型可预测保守性和变异效应。
Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.
8
LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.LMCrot:一种基于转换器的蛋白质语言模型的可解释窗口级嵌入的增强型蛋白质巴豆酰化位点预测器。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.
9
Improving protein-protein interaction prediction using evolutionary information from low-quality MSAs.利用来自低质量多序列比对的进化信息改进蛋白质-蛋白质相互作用预测。
PLoS One. 2017 Feb 6;12(2):e0169356. doi: 10.1371/journal.pone.0169356. eCollection 2017.
10
Protein multiple sequence alignment benchmarking through secondary structure prediction.通过二级结构预测进行蛋白质多序列比对基准测试。
Bioinformatics. 2017 May 1;33(9):1331-1337. doi: 10.1093/bioinformatics/btw840.

引用本文的文献

1
Exo-Tox: Identifying Exotoxins from secreted bacterial proteins.外毒素:从分泌的细菌蛋白中鉴定外毒素
BioData Min. 2025 Aug 8;18(1):52. doi: 10.1186/s13040-025-00469-2.
2
PEGASUS: Prediction of MD-derived protein flexibility from sequence.PEGASUS:从序列预测基于分子动力学的蛋白质柔韧性
Protein Sci. 2025 Aug;34(8):e70221. doi: 10.1002/pro.70221.
3
Nanobodies: From Discovery to AI-Driven Design.纳米抗体:从发现到人工智能驱动的设计

本文引用的文献

1
Chainsaw: protein domain segmentation with fully convolutional neural networks.链锯:基于全卷积神经网络的蛋白质结构域分割。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae296.
2
The difficulty of aligning intrinsically disordered protein sequences as assessed by conservation and phylogeny.基于保守性和系统发生分析评估的无序蛋白质序列比对的困难度。
PLoS One. 2023 Jul 13;18(7):e0288388. doi: 10.1371/journal.pone.0288388. eCollection 2023.
3
CAID prediction portal: a comprehensive service for predicting intrinsic disorder and binding regions in proteins.
Biology (Basel). 2025 May 14;14(5):547. doi: 10.3390/biology14050547.
4
bindNode24: Competitive binding residue prediction with 60 % smaller model.bindNode24:使用小60%的模型进行竞争性结合残基预测。
Comput Struct Biotechnol J. 2025 Mar 11;27:1060-1066. doi: 10.1016/j.csbj.2025.02.042. eCollection 2025.
CAID 预测门户:一个用于预测蛋白质中内源性无序区域和结合区域的综合服务。
Nucleic Acids Res. 2023 Jul 5;51(W1):W62-W69. doi: 10.1093/nar/gkad430.
4
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
5
Light attention predicts protein location from the language of life.轻注意力从生命语言中预测蛋白质位置。
Bioinform Adv. 2021 Nov 19;1(1):vbab035. doi: 10.1093/bioadv/vbab035. eCollection 2021.
6
CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models.CATHe:使用蛋白质语言模型的嵌入来检测 CATH 超家族的远程同源物。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad029.
7
Nearest neighbor search on embeddings rapidly identifies distant protein relations.对嵌入进行最近邻搜索可快速识别远距离蛋白质关系。
Front Bioinform. 2022 Nov 17;2:1033775. doi: 10.3389/fbinf.2022.1033775. eCollection 2022.
8
SETH predicts nuances of residue disorder from protein embeddings.SETH从蛋白质嵌入中预测残基无序的细微差别。
Front Bioinform. 2022 Oct 10;2:1019597. doi: 10.3389/fbinf.2022.1019597. eCollection 2022.
9
TMbed: transmembrane proteins predicted through language model embeddings.TMbed:通过语言模型嵌入预测的跨膜蛋白。
BMC Bioinformatics. 2022 Aug 8;23(1):326. doi: 10.1186/s12859-022-04873-x.
10
Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction.基于蛋白质语言模型的嵌入来实现快速、准确且无需对齐的蛋白质结构预测。
Structure. 2022 Aug 4;30(8):1169-1177.e4. doi: 10.1016/j.str.2022.05.001. Epub 2022 May 23.