Suppr超能文献

基于结构的蛋白质语言模型是变异效应的强大预测工具。

Structure-informed protein language models are robust predictors for variant effects.

作者信息

Sun Yuanfei, Shen Yang

机构信息

Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA.

Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, Texas, USA.

出版信息

Hum Genet. 2025 Mar;144(2-3):209-225. doi: 10.1007/s00439-024-02695-w. Epub 2024 Aug 8.

Abstract

Emerging variant effect predictors, protein language models (pLMs) learn evolutionary distribution of functional sequences to capture fitness landscape. Considering that variant effects are manifested through biological contexts beyond sequence (such as structure), we first assess how much structure context is learned in sequence-only pLMs and affecting variant effect prediction. And we establish a need to inject into pLMs protein structural context purposely and controllably. We thus introduce a framework of structure-informed pLMs (SI-pLMs), by extending masked sequence denoising to cross-modality denoising for both sequence and structure. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, even when using smaller models and less data, are robustly top performers against competing methods including other pLMs, which shows that introducing biological context can be more effective at capturing fitness landscape than simply using larger models or bigger data. Case studies reveal that, compared to sequence-only pLMs, SI-pLMs can be better at capturing fitness landscape because (a) learned embeddings of low/high-fitness sequences can be more separable and (b) learned amino-acid distributions of functionally and evolutionarily conserved residues can be of much lower entropy, thus much more conserved, than other residues. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training.

摘要

新兴的变异效应预测器,即蛋白质语言模型(pLMs),通过学习功能序列的进化分布来捕捉适应性景观。考虑到变异效应是通过序列之外的生物学背景(如结构)表现出来的,我们首先评估在仅基于序列的pLMs中学习到了多少结构背景以及其如何影响变异效应预测。并且我们确定有必要将蛋白质结构背景有意且可控地注入到pLMs中。因此,我们通过将掩码序列去噪扩展为针对序列和结构的跨模态去噪,引入了一个结构信息pLMs(SI-pLMs)框架。在深度诱变扫描基准上的数值结果表明,我们的SI-pLMs即使使用较小的模型和较少的数据,相对于包括其他pLMs在内的竞争方法也能稳健地成为顶级性能者,这表明引入生物学背景在捕捉适应性景观方面可能比简单使用更大的模型或更多的数据更有效。案例研究表明,与仅基于序列的pLMs相比,SI-pLMs在捕捉适应性景观方面可能表现更好,原因如下:(a)低/高适应性序列的学习嵌入可以更可分离;(b)功能和进化上保守残基的学习氨基酸分布的熵可能比其他残基低得多,因此更保守。我们的SI-pLMs适用于通过模型架构和训练目标来修订任何仅基于序列的pLMs。它们在变异效应预测中不需要结构数据作为模型输入,并且在训练期间仅将结构用作背景提供者和模型正则化器。

相似文献

4
Teaching AI to speak protein.教人工智能“说”蛋白质。
Curr Opin Struct Biol. 2025 Apr;91:102986. doi: 10.1016/j.sbi.2025.102986. Epub 2025 Feb 21.
5
Do protein language models learn phylogeny?蛋白质语言模型能学习系统发育吗?
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf047.

引用本文的文献

本文引用的文献

3
Predicting protein structure from single sequences.从单序列预测蛋白质结构。
Nat Comput Sci. 2022 Dec;2(12):775-776. doi: 10.1038/s43588-022-00378-y.
4
ProGen2: Exploring the boundaries of protein language models.ProGen2:探索蛋白质语言模型的边界。
Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.
5
Fast and accurate protein structure search with Foldseek.使用 Foldseek 进行快速准确的蛋白质结构搜索。
Nat Biotechnol. 2024 Feb;42(2):243-246. doi: 10.1038/s41587-023-01773-0. Epub 2023 May 8.
6
NetGO 3.0: Protein Language Model Improves Large-scale Functional Annotations.NetGO 3.0:蛋白质语言模型提高大规模功能注释
Genomics Proteomics Bioinformatics. 2023 Apr;21(2):349-358. doi: 10.1016/j.gpb.2023.04.001. Epub 2023 Apr 17.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验