Nagano Yuta, Pyo Andrew G T, Milighetti Martina, Henderson James, Shawe-Taylor John, Chain Benny, Tiffeau-Mayer Andreas
Division of Infection and Immunity, University College London, London WC1E 6BT, UK; Division of Medicine, University College London, London WC1E 6BT, UK.
Center for the Physics of Biological Function, Princeton University, Princeton, NJ 08544, USA.
Cell Syst. 2025 Jan 15;16(1):101165. doi: 10.1016/j.cels.2024.12.006. Epub 2025 Jan 7.
Computational prediction of the interaction of T cell receptors (TCRs) and their ligands is a grand challenge in immunology. Despite advances in high-throughput assays, specificity-labeled TCR data remain sparse. In other domains, the pre-training of language models on unlabeled data has been successfully used to address data bottlenecks. However, it is unclear how to best pre-train protein language models for TCR specificity prediction. Here, we introduce a TCR language model called SCEPTR (simple contrastive embedding of the primary sequence of T cell receptors), which is capable of data-efficient transfer learning. Through our model, we introduce a pre-training strategy combining autocontrastive learning and masked-language modeling, which enables SCEPTR to achieve its state-of-the-art performance. In contrast, existing protein language models and a variant of SCEPTR pre-trained without autocontrastive learning are outperformed by sequence alignment-based methods. We anticipate that contrastive learning will be a useful paradigm to decode the rules of TCR specificity. A record of this paper's transparent peer review process is included in the supplemental information.
T细胞受体(TCR)与其配体相互作用的计算预测是免疫学中的一项重大挑战。尽管高通量检测技术取得了进展,但特异性标记的TCR数据仍然稀少。在其他领域,在未标记数据上对语言模型进行预训练已成功用于解决数据瓶颈问题。然而,目前尚不清楚如何最好地对蛋白质语言模型进行预训练以用于TCR特异性预测。在此,我们引入了一种名为SCEPTR(T细胞受体一级序列的简单对比嵌入)的TCR语言模型,它能够进行数据高效的迁移学习。通过我们的模型,我们引入了一种将自对比学习和掩码语言建模相结合的预训练策略,这使得SCEPTR能够实现其目前的最优性能。相比之下,现有的蛋白质语言模型以及未经自对比学习预训练的SCEPTR变体在基于序列比对的方法面前表现较差。我们预计对比学习将成为解码TCR特异性规则的一种有用范式。本文透明的同行评审过程记录包含在补充信息中。