Sun Yuanfei, Shen Yang
Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA.
Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, Texas, USA.
Hum Genet. 2025 Mar;144(2-3):209-225. doi: 10.1007/s00439-024-02695-w. Epub 2024 Aug 8.
Emerging variant effect predictors, protein language models (pLMs) learn evolutionary distribution of functional sequences to capture fitness landscape. Considering that variant effects are manifested through biological contexts beyond sequence (such as structure), we first assess how much structure context is learned in sequence-only pLMs and affecting variant effect prediction. And we establish a need to inject into pLMs protein structural context purposely and controllably. We thus introduce a framework of structure-informed pLMs (SI-pLMs), by extending masked sequence denoising to cross-modality denoising for both sequence and structure. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, even when using smaller models and less data, are robustly top performers against competing methods including other pLMs, which shows that introducing biological context can be more effective at capturing fitness landscape than simply using larger models or bigger data. Case studies reveal that, compared to sequence-only pLMs, SI-pLMs can be better at capturing fitness landscape because (a) learned embeddings of low/high-fitness sequences can be more separable and (b) learned amino-acid distributions of functionally and evolutionarily conserved residues can be of much lower entropy, thus much more conserved, than other residues. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training.
新兴的变异效应预测器,即蛋白质语言模型(pLMs),通过学习功能序列的进化分布来捕捉适应性景观。考虑到变异效应是通过序列之外的生物学背景(如结构)表现出来的,我们首先评估在仅基于序列的pLMs中学习到了多少结构背景以及其如何影响变异效应预测。并且我们确定有必要将蛋白质结构背景有意且可控地注入到pLMs中。因此,我们通过将掩码序列去噪扩展为针对序列和结构的跨模态去噪,引入了一个结构信息pLMs(SI-pLMs)框架。在深度诱变扫描基准上的数值结果表明,我们的SI-pLMs即使使用较小的模型和较少的数据,相对于包括其他pLMs在内的竞争方法也能稳健地成为顶级性能者,这表明引入生物学背景在捕捉适应性景观方面可能比简单使用更大的模型或更多的数据更有效。案例研究表明,与仅基于序列的pLMs相比,SI-pLMs在捕捉适应性景观方面可能表现更好,原因如下:(a)低/高适应性序列的学习嵌入可以更可分离;(b)功能和进化上保守残基的学习氨基酸分布的熵可能比其他残基低得多,因此更保守。我们的SI-pLMs适用于通过模型架构和训练目标来修订任何仅基于序列的pLMs。它们在变异效应预测中不需要结构数据作为模型输入,并且在训练期间仅将结构用作背景提供者和模型正则化器。