Yokota Ryo, Kaminaga Yuki, Kobayashi Tetsuya J
Institute of Industrial Science, The University of Tokyo, Tokyo, Japan.
Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan.
Front Immunol. 2017 Nov 15;8:1500. doi: 10.3389/fimmu.2017.01500. eCollection 2017.
Inter-sample comparisons of T-cell receptor (TCR) repertoires are crucial for gaining a better understanding of the immunological states determined by different collections of T cells from different donor sites, cell types, and genetic and pathological backgrounds. For quantitative comparison, most previous studies utilized conventional methods in ecology, which focus on TCR sequences that overlap between pairwise samples. Some recent studies attempted another approach that is categorized into Poisson abundance models using the abundance distribution of observed TCR sequences. However, these methods ignore the details of the measured sequences and are consequently unable to identify sub-repertoires that might have important contributions to the observed inter-sample differences. Moreover, the sparsity of sequence data due to the huge diversity of repertoires hampers the performance of these methods, especially when few overlapping sequences exist. In this paper, we propose a new approach for REpertoire COmparison in Low Dimensions (RECOLD) based on TCR sequence information, which can estimate the low-dimensional structure by embedding the pairwise sequence dissimilarities in high-dimensional sequence space. The inter-sample differences between repertoires are then quantified by information-theoretic measures among the distributions of data estimated in the embedded space. Using datasets of mouse and human TCR repertoires, we demonstrate that RECOLD can accurately identify the inter-sample hierarchical structures, which have a good correspondence with our intuitive understanding about sample conditions. Moreover, for the dataset of transgenic mice that have strong restrictions on the diversity of their repertoires, our estimated inter-sample structure was consistent with the structure estimated by previous methods based on abundance or overlapping sequence information. For the dataset of human healthy donors and Sézary syndrome patients, our method also showed robust estimation performance even under the condition of high sparsity in TCR sequences, while previous studies failed to estimate the structure. In addition, we identified the sequences that contribute to the pairwise-sample differences between the repertoires with the different genetic backgrounds of mice. Such identification of the sequences contributing to variation in immune cell repertoires may provide substantial insight for the development of new immunotherapies and vaccines.
T细胞受体(TCR)库的样本间比较对于更好地理解由来自不同供体部位、细胞类型以及遗传和病理背景的不同T细胞集合所决定的免疫状态至关重要。对于定量比较,大多数先前的研究采用了生态学中的传统方法,这些方法关注成对样本之间重叠的TCR序列。最近的一些研究尝试了另一种方法,即使用观察到的TCR序列的丰度分布归类为泊松丰度模型。然而,这些方法忽略了测量序列的细节,因此无法识别可能对观察到的样本间差异有重要贡献的亚库。此外,由于库的巨大多样性导致的序列数据稀疏性阻碍了这些方法的性能,特别是当重叠序列很少时。在本文中,我们基于TCR序列信息提出了一种用于低维库比较(RECOLD)的新方法,该方法可以通过将成对序列差异嵌入高维序列空间来估计低维结构。然后通过嵌入空间中估计的数据分布之间的信息论度量来量化库之间的样本间差异。使用小鼠和人类TCR库的数据集,我们证明RECOLD可以准确识别样本间的层次结构,这与我们对样本条件的直观理解有很好的对应关系。此外,对于其库多样性受到强烈限制的转基因小鼠数据集,我们估计的样本间结构与先前基于丰度或重叠序列信息的方法估计的结构一致。对于人类健康供体和蕈样肉芽肿综合征患者的数据集,即使在TCR序列高度稀疏的情况下,我们的方法也显示出稳健的估计性能,而先前的研究未能估计出结构。此外,我们确定了导致具有不同小鼠遗传背景的库之间成对样本差异的序列。这种对导致免疫细胞库变异的序列的识别可能为新免疫疗法和疫苗的开发提供实质性的见解。