Le Phi, Ung Leah, Yang Hai, Huang Anwen, He Tao, Bruno Peter, Oh David Y, Keenan Bridget P, Zhang Li
Department of Medicine, University of California San Francisco, 550 16th Street, San Francisco, CA 94158, United States.
Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, 1450 3rd St. San Francisco, CA 94158, United States.
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf351.
Predicting T-cell receptor (TCR) recognizing antigen peptides is crucial for understanding the immune system and developing new treatments for cancer, infectious and autoimmune diseases. As experimental methods for identifying TCR-antigen recognition are expensive and time-consuming, machine-learning approaches are increasingly used. However, existing computational tools often struggle with generalization due to limited data and challenges in acquiring true non-recognition pairs and rarely integrate multiple biological features into unified frameworks. To address these challenges, we propose a two-step framework for predicting TCR-antigen recognition. The first step focuses on feature engineering: neural network-based embeddings of letter-based TCR and peptide sequences inspired by language models, and categorical encoding of Human Leukocyte Antigen types and Variable/Joining genes. In the second step, we built a prediction model to assess the likelihood of TRC-antigen recognition by a Bayesian Feedforward Neural Network. We trained and validated the framework using large public databases. Our results demonstrate that our advanced feature engineering delivers strong predictive performance both internally and externally. We applied the framework to a real-world case for predicting whether specific TCRs can recognize SARS-CoV-2 epitope peptides, demonstrating that our framework can function as a de novo TCR-antigen prediction tool applicable to infectious diseases.
预测T细胞受体(TCR)识别抗原肽对于理解免疫系统以及开发针对癌症、传染病和自身免疫性疾病的新疗法至关重要。由于识别TCR-抗原识别的实验方法既昂贵又耗时,机器学习方法的使用越来越多。然而,由于数据有限以及获取真正的非识别对存在挑战,现有的计算工具往往难以实现泛化,并且很少将多种生物学特征整合到统一框架中。为应对这些挑战,我们提出了一个用于预测TCR-抗原识别的两步框架。第一步侧重于特征工程:受语言模型启发的基于字母的TCR和肽序列的基于神经网络的嵌入,以及人类白细胞抗原类型和可变/连接基因的分类编码。在第二步中,我们构建了一个预测模型,通过贝叶斯前馈神经网络评估TRC-抗原识别的可能性。我们使用大型公共数据库对该框架进行了训练和验证。我们的结果表明,我们先进的特征工程在内部和外部都具有强大的预测性能。我们将该框架应用于一个实际案例,以预测特定的TCR是否能够识别SARS-CoV-2表位肽,这表明我们的框架可以作为一种适用于传染病的全新TCR-抗原预测工具。