基于结构的蛋白质语言模型是变异效应的强大预测工具。

Structure-informed protein language models are robust predictors for variant effects.

作者信息

Sun Yuanfei, Shen Yang

机构信息

Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA.

Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, Texas, USA.

出版信息

Hum Genet. 2025 Mar;144(2-3):209-225. doi: 10.1007/s00439-024-02695-w. Epub 2024 Aug 8.

DOI:10.1007/s00439-024-02695-w

PMID:39117802

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12068927/

Abstract

Emerging variant effect predictors, protein language models (pLMs) learn evolutionary distribution of functional sequences to capture fitness landscape. Considering that variant effects are manifested through biological contexts beyond sequence (such as structure), we first assess how much structure context is learned in sequence-only pLMs and affecting variant effect prediction. And we establish a need to inject into pLMs protein structural context purposely and controllably. We thus introduce a framework of structure-informed pLMs (SI-pLMs), by extending masked sequence denoising to cross-modality denoising for both sequence and structure. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, even when using smaller models and less data, are robustly top performers against competing methods including other pLMs, which shows that introducing biological context can be more effective at capturing fitness landscape than simply using larger models or bigger data. Case studies reveal that, compared to sequence-only pLMs, SI-pLMs can be better at capturing fitness landscape because (a) learned embeddings of low/high-fitness sequences can be more separable and (b) learned amino-acid distributions of functionally and evolutionarily conserved residues can be of much lower entropy, thus much more conserved, than other residues. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training.

摘要

新兴的变异效应预测器，即蛋白质语言模型（pLMs），通过学习功能序列的进化分布来捕捉适应性景观。考虑到变异效应是通过序列之外的生物学背景（如结构）表现出来的，我们首先评估在仅基于序列的pLMs中学习到了多少结构背景以及其如何影响变异效应预测。并且我们确定有必要将蛋白质结构背景有意且可控地注入到pLMs中。因此，我们通过将掩码序列去噪扩展为针对序列和结构的跨模态去噪，引入了一个结构信息pLMs（SI-pLMs）框架。在深度诱变扫描基准上的数值结果表明，我们的SI-pLMs即使使用较小的模型和较少的数据，相对于包括其他pLMs在内的竞争方法也能稳健地成为顶级性能者，这表明引入生物学背景在捕捉适应性景观方面可能比简单使用更大的模型或更多的数据更有效。案例研究表明，与仅基于序列的pLMs相比，SI-pLMs在捕捉适应性景观方面可能表现更好，原因如下：（a）低/高适应性序列的学习嵌入可以更可分离；（b）功能和进化上保守残基的学习氨基酸分布的熵可能比其他残基低得多，因此更保守。我们的SI-pLMs适用于通过模型架构和训练目标来修订任何仅基于序列的pLMs。它们在变异效应预测中不需要结构数据作为模型输入，并且在训练期间仅将结构用作背景提供者和模型正则化器。

相似文献

Structure-informed protein language models are robust predictors for variant effects.基于结构的蛋白质语言模型是变异效应的强大预测工具。

Hum Genet. 2025 Mar;144(2-3):209-225. doi: 10.1007/s00439-024-02695-w. Epub 2024 Aug 8.

Structure-Informed Protein Language Models are Robust Predictors for Variant Effects.结构信息蛋白质语言模型是变异效应的稳健预测器。

Res Sq. 2023 Aug 3:rs.3.rs-3219092. doi: 10.21203/rs.3.rs-3219092/v1.

Assessing the role of evolutionary information for enhancing protein language model embeddings.评估进化信息在增强蛋白质语言模型嵌入中的作用。

Sci Rep. 2024 Sep 5;14(1):20692. doi: 10.1038/s41598-024-71783-8.

Teaching AI to speak protein.教人工智能“说”蛋白质。

Curr Opin Struct Biol. 2025 Apr;91:102986. doi: 10.1016/j.sbi.2025.102986. Epub 2025 Feb 21.

Do protein language models learn phylogeny?蛋白质语言模型能学习系统发育吗？

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf047.

Protein language models learn evolutionary statistics of interacting sequence motifs.蛋白质语言模型学习相互作用序列基序的进化统计信息。

Proc Natl Acad Sci U S A. 2024 Nov 5;121(45):e2406285121. doi: 10.1073/pnas.2406285121. Epub 2024 Oct 28.

ProCeSa: Contrast-Enhanced Structure-Aware Network for Thermostability Prediction with Protein Language Models.ProCeSa：用于蛋白质语言模型热稳定性预测的对比增强结构感知网络。

J Chem Inf Model. 2025 Mar 10;65(5):2304-2313. doi: 10.1021/acs.jcim.4c01752. Epub 2025 Feb 23.

Protein language models meet reduced amino acid alphabets.蛋白质语言模型与简化的氨基酸字母表相遇。

Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae061.

S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure.S-PLM：通过序列与结构之间的对比学习实现的结构感知蛋白质语言模型

Adv Sci (Weinh). 2025 Feb;12(5):e2404212. doi: 10.1002/advs.202404212. Epub 2024 Dec 12.

How well do contextual protein encodings learn structure, function, and evolutionary context?上下文蛋白质编码在学习结构、功能和进化背景方面的效果如何？

Cell Syst. 2025 Mar 19;16(3):101201. doi: 10.1016/j.cels.2025.101201. Epub 2025 Mar 4.

引用本文的文献

PLMFit: benchmarking transfer learning with protein language models for protein engineering.PLMFit：使用蛋白质语言模型进行蛋白质工程的迁移学习基准测试

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf381.

Evaluation of enzyme activity predictions for variants of unknown significance in Arylsulfatase A.芳基硫酸酯酶A中意义未明变异体的酶活性预测评估。

Hum Genet. 2025 Mar;144(2-3):295-308. doi: 10.1007/s00439-025-02731-3. Epub 2025 Mar 8.

Assessing the predicted impact of single amino acid substitutions in MAPK proteins for CAGI6 challenges.评估丝裂原活化蛋白激酶（MAPK）蛋白中单个氨基酸取代对CAGI6挑战的预测影响。

Hum Genet. 2025 Mar;144(2-3):265-280. doi: 10.1007/s00439-024-02724-8. Epub 2025 Feb 20.

Frontiers in integrative structural modeling of macromolecular assemblies.大分子组装体的整合结构建模前沿

QRB Discov. 2025 Jan 22;6:e3. doi: 10.1017/qrd.2024.15. eCollection 2025.

Evaluating predictors of kinase activity of STK11 variants identified in primary human non-small cell lung cancers.评估在原发性人类非小细胞肺癌中鉴定出的STK11变体激酶活性的预测因子。

Hum Genet. 2025 Mar;144(2-3):127-142. doi: 10.1007/s00439-025-02726-0. Epub 2025 Feb 12.

Site-specific prediction of O-GlcNAc modification in proteins using evolutionary scale model.使用进化尺度模型对蛋白质中O-连接的N-乙酰葡糖胺修饰进行位点特异性预测。

PLoS One. 2024 Dec 31;19(12):e0316215. doi: 10.1371/journal.pone.0316215. eCollection 2024.

Assessing the predicted impact of single amino acid substitutions in calmodulin for CAGI6 challenges.评估钙调蛋白中单个氨基酸取代对CAGI6挑战的预测影响。

Hum Genet. 2025 Mar;144(2-3):113-125. doi: 10.1007/s00439-024-02720-y. Epub 2024 Dec 23.

Res Sq. 2024 Jul 2:rs.3.rs-4587317. doi: 10.21203/rs.3.rs-4587317/v1.

Evaluation of enzyme activity predictions for variants of unknown significance in Arylsulfatase A.芳基硫酸酯酶A中意义未明变异体的酶活性预测评估。

bioRxiv. 2024 Jun 17:2024.05.16.594558. doi: 10.1101/2024.05.16.594558.

Novel antibody language model accelerates IgG screening and design for broad-spectrum antiviral therapy.新型抗体语言模型加速了用于广谱抗病毒治疗的IgG筛选与设计。

bioRxiv. 2024 Aug 20:2024.03.01.582176. doi: 10.1101/2024.03.01.582176.

本文引用的文献

MaveDB 2024: a curated community database with over seven million variant effects from multiplexed functional assays.MaveDB 2024：一个经过整理的社区数据库，包含来自多重功能测定的超过700万个变异效应。

Genome Biol. 2025 Jan 21;26(1):13. doi: 10.1186/s13059-025-03476-y.

CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods.CAGI，即基因组解读的关键评估，旨在评估计算遗传变异解读方法的进展和前景。

Genome Biol. 2024 Feb 22;25(1):53. doi: 10.1186/s13059-023-03113-6.

Predicting protein structure from single sequences.从单序列预测蛋白质结构。

Nat Comput Sci. 2022 Dec;2(12):775-776. doi: 10.1038/s43588-022-00378-y.

ProGen2: Exploring the boundaries of protein language models.ProGen2：探索蛋白质语言模型的边界。

Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.

Fast and accurate protein structure search with Foldseek.使用 Foldseek 进行快速准确的蛋白质结构搜索。

Nat Biotechnol. 2024 Feb;42(2):243-246. doi: 10.1038/s41587-023-01773-0. Epub 2023 May 8.

NetGO 3.0: Protein Language Model Improves Large-scale Functional Annotations.NetGO 3.0：蛋白质语言模型提高大规模功能注释

Genomics Proteomics Bioinformatics. 2023 Apr;21(2):349-358. doi: 10.1016/j.gpb.2023.04.001. Epub 2023 Apr 17.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

Large language models generate functional protein sequences across diverse families.大型语言模型可生成不同家族的功能性蛋白质序列。

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT：一种通用的蛋白质序列和功能深度学习模型。

Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.

Embeddings from protein language models predict conservation and variant effects.基于蛋白质语言模型的嵌入模型可预测保守性和变异效应。

Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验