Center for Bioinformatics, Faculty of computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China.
General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China.
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbab595.
Predicting the binding of peptide and major histocompatibility complex (MHC) plays a vital role in immunotherapy for cancer. The success of Alphafold of applying natural language processing (NLP) algorithms in protein secondary struction prediction has inspired us to explore the possibility of NLP methods in predicting peptide-MHC class I binding. Based on the above motivations, we propose the MHCRoBERTa method, RoBERTa pre-training approach, for predicting the binding affinity between type I MHC and peptides. Analysis of the results on benchmark dataset demonstrates that MHCRoBERTa can outperform other state-of-art prediction methods with an increase of the Spearman rank correlation coefficient (SRCC) value. Notably, our model gave a significant improvement on IC50 value. Our method has achieved SRCC value and AUC value as 0.785 and 0.817, respectively. Our SRCC value is 14.3% higher than NetMHCpan3.0 (the second highest SRCC value on pan-specific) and is 3% higher than MHCflurry (the second highest SRCC value on all methods). The AUC value is also better than any other pan-specific methods. Moreover, we visualize the multi-head self-attention for the token representation across the layers and heads by this method. Through the analysis of the representation of each layer and head, we can show whether the model has learned the syntax and semantics necessary to perform the prediction task well. All these results demonstrate that our model can accurately predict the peptide-MHC class I binding affinity and that MHCRoBERTa is a powerful tool for screening potential neoantigens for cancer immunotherapy. MHCRoBERTa is available as an open source software at github (https://github.com/FuxuWang/MHCRoBERTa).
预测肽和主要组织相容性复合体(MHC)的结合在癌症的免疫治疗中起着至关重要的作用。Alphafold 在应用自然语言处理(NLP)算法进行蛋白质二级结构预测方面的成功,启发我们探索 NLP 方法在预测肽-MHC Ⅰ类结合中的可能性。基于上述动机,我们提出了 MHCRoBERTa 方法,这是一种 RoBERTa 预训练方法,用于预测 I 型 MHC 与肽之间的结合亲和力。在基准数据集上的分析结果表明,MHCRoBERTa 可以优于其他最先进的预测方法,提高 Spearman 秩相关系数(SRCC)值。值得注意的是,我们的模型在 IC50 值上有显著提高。我们的方法分别达到了 0.785 的 SRCC 值和 0.817 的 AUC 值。我们的 SRCC 值比 NetMHCpan3.0(泛特异性中第二高的 SRCC 值)高 14.3%,比 MHCflurry(所有方法中第二高的 SRCC 值)高 3%。AUC 值也优于任何其他泛特异性方法。此外,我们通过这种方法可视化了跨层和多头的令牌表示的多头自注意力。通过对每个层和头的表示进行分析,我们可以展示模型是否已经学习了执行预测任务所需的语法和语义。所有这些结果都表明,我们的模型可以准确地预测肽-MHC Ⅰ类结合亲和力,并且 MHCRoBERTa 是筛选癌症免疫治疗潜在新抗原的强大工具。MHCRoBERTa 可在 github(https://github.com/FuxuWang/MHCRoBERTa)上作为开源软件获得。