The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel.
Sci Adv. 2024 Apr 26;10(17):eadk4670. doi: 10.1126/sciadv.adk4670.
The T cell receptor (TCR) repertoire is an extraordinarily diverse collection of TCRs essential for maintaining the body's homeostasis and response to threats. In this study, we compiled an extensive dataset of more than 4200 bulk TCR repertoire samples, encompassing 221,176,713 sequences, alongside 6,159,652 single-cell TCR sequences from over 400 samples. From this dataset, we then selected a representative subset of 5 million bulk sequences and 4.2 million single-cell sequences to train two specialized Transformer-based language models for bulk (CVC) and single-cell (scCVC) TCR repertoires, respectively. We show that these models successfully capture TCR core qualities, such as sharing, gene composition, and single-cell properties. These qualities are emergent in the encoded TCR latent space and enable classification into TCR-based qualities such as public sequences. These models demonstrate the potential of Transformer-based language models in TCR downstream applications.
T 细胞受体 (TCR) 库是一组极其多样化的 TCR 集合,对于维持身体的内稳态和对威胁的反应至关重要。在这项研究中,我们编制了一个包含超过 4200 个批量 TCR 库样本的广泛数据集,其中包含 221176713 个序列,以及来自 400 多个样本的 6159652 个单细胞 TCR 序列。从这个数据集中,我们选择了一个具有代表性的 500 万个批量序列和 420 万个单细胞序列的子集,分别用于训练两个专门的基于 Transformer 的批量 (CVC) 和单细胞 (scCVC) TCR 库语言模型。我们表明,这些模型成功地捕获了 TCR 的核心品质,例如共享、基因组成和单细胞特性。这些品质在编码的 TCR 潜在空间中是涌现的,并能够将 TCR 分为基于公共序列等品质的类别。这些模型展示了基于 Transformer 的语言模型在 TCR 下游应用中的潜力。