Omelchenko Alisa A, Siwek Jane C, Chhibbar Prabal, Arshad Sanya, Nazarali Iliyan, Nazarali Kiran, Rosengart AnnaElaine, Rahimikollu Javad, Tilstra Jeremy, Shlomchik Mark J, Koes David R, Joglekar Alok V, Das Jishnu
Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA.
Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA.
bioRxiv. 2024 May 4:2024.05.01.592062. doi: 10.1101/2024.05.01.592062.
The explosion of sequence data has allowed the rapid growth of protein language models (pLMs). pLMs have now been employed in many frameworks including variant-effect and peptide-specificity prediction. Traditionally, for protein-protein or peptide-protein interactions (PPIs), corresponding sequences are either co-embedded followed by post-hoc integration or the sequences are concatenated prior to embedding. Interestingly, no method utilizes a language representation of the interaction itself. We developed an interaction LM (iLM), which uses a novel language to represent interactions between protein/peptide sequences. Sliding Window Interaction Grammar (SWING) leverages differences in amino acid properties to generate an interaction vocabulary. This vocabulary is the input into a LM followed by a supervised prediction step where the LM's representations are used as features. SWING was first applied to predicting peptide:MHC (pMHC) interactions. SWING was not only successful at generating Class I and Class II models that have comparable prediction to state-of-the-art approaches, but the unique Mixed Class model was also successful at jointly predicting both classes. Further, the SWING model trained only on Class I alleles was predictive for Class II, a complex prediction task not attempted by any existing approach. For de novo data, using only Class I or Class II data, SWING also accurately predicted Class II pMHC interactions in murine models of SLE (MRL/lpr model) and T1D (NOD model), that were validated experimentally. To further evaluate SWING's generalizability, we tested its ability to predict the disruption of specific protein-protein interactions by missense mutations. Although modern methods like AlphaMissense and ESM1b can predict interfaces and variant effects/pathogenicity per mutation, they are unable to predict interaction-specific disruptions. SWING was successful at accurately predicting the impact of both Mendelian mutations and population variants on PPIs. This is the first generalizable approach that can accurately predict interaction-specific disruptions by missense mutations with only sequence information. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs.
序列数据的爆炸式增长推动了蛋白质语言模型(pLMs)的快速发展。目前,pLMs已被应用于许多框架中,包括变异效应和肽特异性预测。传统上,对于蛋白质-蛋白质或肽-蛋白质相互作用(PPIs),相应序列要么先进行共嵌入,然后进行事后整合,要么在嵌入之前进行拼接。有趣的是,没有一种方法利用相互作用本身的语言表示。我们开发了一种相互作用语言模型(iLM),它使用一种新颖的语言来表示蛋白质/肽序列之间的相互作用。滑动窗口相互作用语法(SWING)利用氨基酸特性的差异来生成相互作用词汇表。这个词汇表作为输入进入语言模型,随后是一个监督预测步骤,其中语言模型的表示被用作特征。SWING首先应用于预测肽:主要组织相容性复合体(pMHC)相互作用。SWING不仅成功生成了与现有最先进方法具有可比预测能力的I类和II类模型,而且独特的混合类模型在联合预测这两类相互作用方面也取得了成功。此外,仅在I类等位基因上训练的SWING模型对II类具有预测能力,这是任何现有方法都未尝试过的复杂预测任务。对于从头数据,仅使用I类或II类数据,SWING也准确预测了系统性红斑狼疮(MRL/lpr模型)和1型糖尿病(NOD模型)小鼠模型中的II类pMHC相互作用,并通过实验得到了验证。为了进一步评估SWING的通用性,我们测试了它预测错义突变对特定蛋白质-蛋白质相互作用破坏的能力。尽管像AlphaMissense和ESM1b这样的现代方法可以预测每个突变的界面以及变异效应/致病性,但它们无法预测相互作用特异性的破坏。SWING成功地准确预测了孟德尔突变和群体变异对PPIs的影响。这是第一种仅利用序列信息就能准确预测错义突变对相互作用特异性破坏的通用方法。总体而言,SWING是一流的可通用零样本iLM,它学习PPIs的语言。