Department of Biostatistics and Data Science, The University of Texas Health Science Center at Houston, 77030, Texas, Houston, USA.
Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 77030, Texas, Houston, USA.
Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad230.
The T-cell receptor (TCR) repertoire is highly diverse among the population and plays an essential role in initiating multiple immune processes. TCR sequencing (TCR-seq) has been developed to profile the T cell repertoire. Similar to other high-throughput experiments, contamination can happen during several steps of TCR-seq, including sample collection, preparation and sequencing. Such contamination creates artifacts in the data, leading to inaccurate or even biased results. Most existing methods assume 'clean' TCR-seq data as the starting point with no ability to handle data contamination. Here, we develop a novel statistical model to systematically detect and remove contamination in TCR-seq data. We summarize the observed contamination into two sources, pairwise and cross-cohort. For both sources, we provide visualizations and summary statistics to help users assess the severity of the contamination. Incorporating prior information from 14 existing TCR-seq datasets with minimum contamination, we develop a straightforward Bayesian model to statistically identify contaminated samples. We further provide strategies for removing the impacted sequences to allow for downstream analysis, thus avoiding any need to repeat experiments. Our proposed model shows robustness in contamination detection compared with a few off-the-shelf detection methods in simulation studies. We illustrate the use of our proposed method on two TCR-seq datasets generated locally.
T 细胞受体 (TCR) 库在人群中高度多样化,在启动多种免疫过程中发挥着重要作用。TCR 测序 (TCR-seq) 已被开发用于分析 T 细胞库。与其他高通量实验类似,TCR-seq 的多个步骤都可能发生污染,包括样本采集、准备和测序。这种污染会在数据中产生伪影,导致结果不准确甚至有偏差。大多数现有方法都假设“干净”的 TCR-seq 数据作为起点,无法处理数据污染。在这里,我们开发了一种新的统计模型来系统地检测和去除 TCR-seq 数据中的污染。我们将观察到的污染总结为两种来源,即成对和跨队列。对于这两种来源,我们提供可视化和汇总统计信息,以帮助用户评估污染的严重程度。我们结合了来自 14 个具有最小污染的现有 TCR-seq 数据集的先验信息,开发了一种简单的贝叶斯模型来从统计学上识别污染样本。我们进一步提供了去除受影响序列的策略,以便进行下游分析,从而避免重复实验的需要。与模拟研究中的几种现成检测方法相比,我们提出的模型在污染检测方面表现出稳健性。我们在本地生成的两个 TCR-seq 数据集上说明了我们提出的方法的使用。