Suppr超能文献

基于 TCR 序列的自监督学习揭示了 T 细胞身份的核心特征。

Self-supervised learning of T cell receptor sequences exposes core properties for T cell membership.

机构信息

The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel.

出版信息

Sci Adv. 2024 Apr 26;10(17):eadk4670. doi: 10.1126/sciadv.adk4670.

Abstract

The T cell receptor (TCR) repertoire is an extraordinarily diverse collection of TCRs essential for maintaining the body's homeostasis and response to threats. In this study, we compiled an extensive dataset of more than 4200 bulk TCR repertoire samples, encompassing 221,176,713 sequences, alongside 6,159,652 single-cell TCR sequences from over 400 samples. From this dataset, we then selected a representative subset of 5 million bulk sequences and 4.2 million single-cell sequences to train two specialized Transformer-based language models for bulk (CVC) and single-cell (scCVC) TCR repertoires, respectively. We show that these models successfully capture TCR core qualities, such as sharing, gene composition, and single-cell properties. These qualities are emergent in the encoded TCR latent space and enable classification into TCR-based qualities such as public sequences. These models demonstrate the potential of Transformer-based language models in TCR downstream applications.

摘要

T 细胞受体 (TCR) 库是一组极其多样化的 TCR 集合,对于维持身体的内稳态和对威胁的反应至关重要。在这项研究中,我们编制了一个包含超过 4200 个批量 TCR 库样本的广泛数据集,其中包含 221176713 个序列,以及来自 400 多个样本的 6159652 个单细胞 TCR 序列。从这个数据集中,我们选择了一个具有代表性的 500 万个批量序列和 420 万个单细胞序列的子集,分别用于训练两个专门的基于 Transformer 的批量 (CVC) 和单细胞 (scCVC) TCR 库语言模型。我们表明,这些模型成功地捕获了 TCR 的核心品质,例如共享、基因组成和单细胞特性。这些品质在编码的 TCR 潜在空间中是涌现的,并能够将 TCR 分为基于公共序列等品质的类别。这些模型展示了基于 Transformer 的语言模型在 TCR 下游应用中的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6b3/11809652/d5b7686c0c7d/sciadv.adk4670-f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验