通过肽段距离分析评估TCR结合预测器的泛化能力。

Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.

作者信息

Castorina Leonardo V, Grazioli Filippo, Machart Pierre, Mösch Anja, Errica Federico

机构信息

School of Informatics, University of Edinburgh, Edinburgh, United Kingdom.

NEC Laboratories Europe, Heidelberg, Germany.

出版信息

PLoS One. 2025 May 20;20(5):e0324011. doi: 10.1371/journal.pone.0324011. eCollection 2025.

Abstract

Understanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. While recent machine learning (ML) models show remarkable success in predicting TCR-pMHC binding within training data, these models often fail to generalize to peptides outside their training distributions, raising concerns about their applicability in therapeutic settings. Understanding and improving the generalization of these models is therefore critical to ensure real-world applications. To address this issue, we evaluate the effect of the distance between training and testing peptide distributions on ML model empirical risk assessments, using sequence-based and 3D structure-based distance metrics. In our analysis we use several state-of-the-art models for TCR-peptide binding prediction: Attentive Variational Information Bottleneck (AVIB), NetTCR-2.0 and -2.2, and ERGO II (pre-trained autoencoder) and ERGO II (LSTM). In this work, we introduce a novel approach for assessing the generalization capabilities of TCR binding predictors: the Distance Split (DS) algorithm. The DS algorithm controls the distance between training and testing peptides based on both sequence and structure, allowing for a more nuanced evaluation of model performance. We show that lower 3D shape similarity between training and test peptides is associated with a harder out-of-distribution task definition, which is more interesting when measuring the ability to generalize to unseen peptides. However, we observe the opposite effect when splitting using sequence-based similarity. These findings highlight the importance of using a distance-based splitting approach to benchmark models. This could then be used to estimate a confidence score on predictions on novel and unseen peptides, based on how different they are from the training ones. Additionally, our results may hint that employing 3D shape to complement sequence information could improve the accuracy of TCR-pMHC binding predictors.

摘要

理解T细胞受体(TCR)与肽结合的主要组织相容性复合体(pMHC)之间的相互作用对于理解免疫反应和开发靶向免疫疗法至关重要。虽然最近的机器学习(ML)模型在预测训练数据中的TCR-pMHC结合方面取得了显著成功,但这些模型往往无法推广到其训练分布之外的肽段,这引发了人们对其在治疗环境中适用性的担忧。因此,理解和提高这些模型的泛化能力对于确保实际应用至关重要。为了解决这个问题,我们使用基于序列和基于3D结构的距离度量,评估训练和测试肽分布之间的距离对ML模型经验风险评估的影响。在我们的分析中,我们使用了几种用于TCR-肽结合预测的先进模型:注意力变分信息瓶颈(AVIB)、NetTCR-2.0和-2.2,以及ERGO II(预训练自动编码器)和ERGO II(长短期记忆网络)。在这项工作中,我们引入了一种评估TCR结合预测器泛化能力的新方法:距离分割(DS)算法。DS算法基于序列和结构控制训练和测试肽之间的距离,从而能够更细致地评估模型性能。我们表明,训练和测试肽之间较低的3D形状相似性与更难的分布外任务定义相关,这在测量推广到未见肽的能力时更有意义。然而,当使用基于序列的相似性进行分割时,我们观察到相反的效果。这些发现突出了使用基于距离的分割方法对模型进行基准测试的重要性。然后,这可以用于根据新肽和未见肽与训练肽的差异程度来估计对它们预测的置信度得分。此外,我们的结果可能暗示,采用3D形状来补充序列信息可以提高TCR-pMHC结合预测器的准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cbd0/12091837/1cb1322669b7/pone.0324011.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索