UT/ORNL Center for Molecular Biophysics, Oak Ridge National Laboratory, Oak Ridge, TN, United States.
Department of Biochemistry and Cellular and Molecular Biology, University of Tennessee, Knoxville, TN, United States.
Front Immunol. 2024 Aug 16;15:1426173. doi: 10.3389/fimmu.2024.1426173. eCollection 2024.
Artificial-intelligence and machine-learning (AI/ML) approaches to predicting T-cell receptor (TCR)-epitope specificity achieve high performance metrics on test datasets which include sequences that are also part of the training set but fail to generalize to test sets consisting of epitopes and TCRs that are absent from the training set, i.e., are 'unseen' during training of the ML model. We present TCR-H, a supervised classification Support Vector Machines model using physicochemical features trained on the largest dataset available to date using only experimentally validated non-binders as negative datapoints. TCR-H exhibits an area under the curve of the receiver-operator characteristic (AUC of ROC) of 0.87 for epitope 'hard splitting' (i.e., on test sets with all epitopes unseen during ML training), 0.92 for TCR hard splitting and 0.89 for 'strict splitting' in which neither the epitopes nor the TCRs in the test set are seen in the training data. Furthermore, we employ the SHAP (Shapley additive explanations) eXplainable AI (XAI) method for interrogation to interpret the models trained with different hard splits, shedding light on the key physiochemical features driving model predictions. TCR-H thus represents a significant step towards general applicability and explainability of epitope:TCR specificity prediction.
人工智能和机器学习 (AI/ML) 方法在预测 T 细胞受体 (TCR)-表位特异性方面在测试数据集上取得了高性能指标,这些数据集包括也属于训练集的序列,但无法推广到测试集,因为测试集中的表位和 TCR 不在训练集中,即,在 ML 模型的训练过程中是“看不见的”。我们提出了 TCR-H,这是一种基于监督分类支持向量机的模型,使用基于目前最大数据集的物理化学特征进行训练,仅将实验验证的非结合物用作负数据点。TCR-H 在表位“硬分割”(即在测试集中,所有表位在 ML 训练期间都未被看到)的接收者操作特征曲线下面积 (ROC 的 AUC) 为 0.87,TCR 硬分割为 0.92,“严格分割”为 0.89,其中测试集中的表位和 TCR 都未在训练数据中看到。此外,我们还采用了 SHAP(Shapley Additive Explanations)可解释 AI(XAI)方法进行询问,以解释不同硬分割训练的模型,阐明驱动模型预测的关键物理化学特征。因此,TCR-H 代表了朝着普遍适用性和可解释性的表位:TCR 特异性预测迈出了重要一步。