The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel.
Phys Biol. 2013 Oct;10(5):056001. doi: 10.1088/1478-3975/10/5/056001. Epub 2013 Aug 22.
Complementarity-determining region 3 (CDR3) is the most hyper-variable region in B cell receptor (BCR) and T cell receptor (TCR) genes, and the most critical structure in antigen recognition and thereby in determining the fates of developing and responding lymphocytes. There are millions of different TCR Vβ chain or BCR heavy chain CDR3 sequences in human blood. Even now, when high-throughput sequencing becomes widely used, CDR3 length distributions (also called spectratypes) are still a much quicker and cheaper method of assessing repertoire diversity. However, distribution complexity and the large amount of information per sample (e.g. 32 distributions of the TCRα chain, and 24 of TCRβ) calls for the use of machine learning tools for full exploration. We have examined the ability of supervised machine learning, which uses computational models to find hidden patterns in predefined biological groups, to analyze CDR3 length distributions from various sources, and distinguish between experimental groups. We found that (a) splenic BCR CDR3 length distributions are characterized by low standard deviations and few local maxima, compared to peripheral blood distributions; (b) healthy elderly people's BCR CDR3 length distributions can be distinguished from those of the young; and (c) a machine learning model based on TCR CDR3 distribution features can detect myelodysplastic syndrome with approximately 93% accuracy. Overall, we demonstrate that using supervised machine learning methods can contribute to our understanding of lymphocyte repertoire diversity.
互补决定区 3(CDR3)是 B 细胞受体(BCR)和 T 细胞受体(TCR)基因中最可变的区域,也是抗原识别中最关键的结构,从而决定了发育中和应答性淋巴细胞的命运。人类血液中有数百万种不同的 TCR Vβ 链或 BCR 重链 CDR3 序列。即使在现在,当高通量测序变得广泛应用时,CDR3 长度分布(也称为谱型)仍然是评估库多样性的更快、更便宜的方法。然而,分布的复杂性和每个样本的大量信息(例如,TCRα链的 32 个分布和 TCRβ的 24 个分布)需要使用机器学习工具进行全面探索。我们已经检验了监督机器学习的能力,这种机器学习使用计算模型在预定义的生物组中寻找隐藏模式,来分析来自不同来源的 CDR3 长度分布,并区分实验组。我们发现:(a)与外周血分布相比,脾脏 BCR CDR3 长度分布的标准差较低,局部最大值较少;(b)健康老年人的 BCR CDR3 长度分布可以与年轻人区分开来;(c)基于 TCR CDR3 分布特征的机器学习模型可以以约 93%的准确率检测骨髓增生异常综合征。总的来说,我们证明了使用监督机器学习方法可以有助于我们理解淋巴细胞库多样性。