Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel.
Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel.
Front Immunol. 2021 Jul 22;12:680687. doi: 10.3389/fimmu.2021.680687. eCollection 2021.
The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the challenge of extracting meaningful biological information from multi-dimensional data. The ability to embed these DNA and amino acid textual sequences in a vector-space is an important step towards developing effective analysis methods. Here we present Immune2vec, an adaptation of a natural language processing (NLP)-based embedding technique for BCR repertoire sequencing data. We validate Immune2vec on amino acid 3-gram sequences, continuing to longer BCR sequences, and finally to entire repertoires. Our work demonstrates Immune2vec to be a reliable low-dimensional representation that preserves relevant information of immune sequencing data, such as n-gram properties and IGHV gene family classification. Applying Immune2vec along with machine learning approaches to patient data exemplifies how distinct clinical conditions can be effectively stratified, indicating that the embedding space can be used for feature extraction and exploratory data analysis.
免疫系统的适应性分支学习病原模式,并为未来的遭遇记住它们。它通过 T 细胞和 B 细胞受体(分别为 TCR 和 BCR)的动态和多样化的受体库来实现这一点。在每个人中,这些巨大的免疫库都给研究人员带来了从多维数据中提取有意义的生物学信息的挑战。将这些 DNA 和氨基酸文本序列嵌入向量空间的能力是开发有效分析方法的重要步骤。在这里,我们提出了 Immune2vec,这是一种基于自然语言处理(NLP)的 BCR 库测序数据嵌入技术的改编。我们在氨基酸 3 克序列上验证了 Immune2vec,继续到更长的 BCR 序列,最后到整个库。我们的工作表明 Immune2vec 是一种可靠的低维表示,它保留了免疫测序数据的相关信息,例如 n 克特性和 IGHV 基因家族分类。将 Immune2vec 与机器学习方法一起应用于患者数据,说明了如何有效地对不同的临床情况进行分层,表明嵌入空间可用于特征提取和探索性数据分析。