Suppr超能文献

结构信息蛋白质语言模型是变异效应的稳健预测器。

Structure-Informed Protein Language Models are Robust Predictors for Variant Effects.

作者信息

Sun Yuanfei, Shen Yang

机构信息

Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA.

Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, Texas, USA.

出版信息

Res Sq. 2023 Aug 3:rs.3.rs-3219092. doi: 10.21203/rs.3.rs-3219092/v1.

Abstract

Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional protein sequences. However, biological contexts important to variant effects are implicitly modeled and effectively marginalized. By assessing the sequence awareness and the structure awareness of pLMs, we find that their improvements often correlate with better variant effect prediction but their tradeoff can present a barrier as observed in over-finetuning to specific family sequences. We introduce a framework of structure-informed pLMs (SI-pLMs) to inject protein structural contexts purposely and controllably, by extending masked sequence denoising in conventional pLMs to cross-modality denoising. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, despite relatively compact sizes, are robustly top performers against competing methods including other pLMs, regardless of the target protein family's evolutionary information content or the tendency to overfitting / over-finetuning. Learned distributions in structural contexts could enhance sequence distributions in predicting variant effects. Ablation studies reveal major contributing factors and analyses of sequence embeddings provide further insights. The data and scripts are available at https://github.com/Stephen2526/Structure-informed_PLM.git.

摘要

通过机器学习预测蛋白质变体效应通常受到实验测量效应标签稀缺的挑战。最近,蛋白质语言模型(pLMs)作为零样本预测器出现,无需效应标签,通过对功能性蛋白质序列的进化分布进行建模。然而,对变体效应重要的生物学背景被隐含建模并有效边缘化。通过评估pLMs的序列感知和结构感知,我们发现它们的改进通常与更好的变体效应预测相关,但它们的权衡可能会成为一个障碍,如在对特定家族序列进行过度微调时所观察到的那样。我们引入了一个结构信息pLMs(SI-pLMs)框架,通过将传统pLMs中的掩码序列去噪扩展到跨模态去噪,有目的地和可控地注入蛋白质结构背景。我们的SI-pLMs适用于通过模型架构和训练目标来修正任何仅基于序列的pLMs。它们不需要结构数据作为变体效应预测的模型输入,仅在训练期间将结构用作背景提供者和模型正则化器。在深度诱变扫描基准上的数值结果表明,我们的SI-pLMs尽管尺寸相对紧凑,但在与包括其他pLMs在内的竞争方法相比时,无论目标蛋白质家族的进化信息含量或过拟合/过度微调的倾向如何,都是稳健的顶级 performers。在预测变体效应时,结构背景中的学习分布可以增强序列分布。消融研究揭示了主要贡献因素,对序列嵌入的分析提供了进一步的见解。数据和脚本可在https://github.com/Stephen2526/Structure-informed_PLM.git获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/50c1/10418537/5c15b891f8fc/nihpp-rs3219092v1-f0006.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验