Chen Zhongren, Tian Lu, Olshen Richard A
Department of Statistics and Data Science, Yale University, New Haven, CT, USA.
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
J Appl Stat. 2025 Apr 30. doi: 10.1080/02664763.2025.2496724.
This paper is motivated by the need to quantify human immune responses to environmental challenges. Specifically, the genome of the selected cell population from a blood sample is amplified by the PCR process, producing a large number of reads. Each read corresponds to a particular rearrangement of so-called V(D)J sequences. The observed data consist of a set of integers, representing numbers of reads corresponding to different V(D)J sequences. The underlying relative frequencies of distinct V(D)J sequences can be summarized by a probability vector, with the cardinality being the number of distinct V(D)J rearrangements. The statistical question is to make inferences on a summary parameter of this probability vector based on a multinomial-type observation of a large dimension. Popular summaries of the diversity include clonality and entropy. A point estimator of the clonality based on multiple replicates from the same blood sample has been proposed previously. Therefore, the remaining challenge is to construct confidence intervals of the parameters to reflect their uncertainty. In this paper, we propose to couple the Empirical Bayes method with a resampling-based calibration procedure to construct a robust confidence interval for different population diversity parameters. The method is illustrated via extensive numerical studies and real data examples.
本文的动机是量化人类对环境挑战的免疫反应。具体而言,通过PCR过程扩增血样中选定细胞群体的基因组,产生大量读数。每个读数对应于所谓V(D)J序列的特定重排。观测数据由一组整数组成,代表对应于不同V(D)J序列的读数数量。不同V(D)J序列的潜在相对频率可以用一个概率向量来概括,其基数是不同V(D)J重排的数量。统计问题是基于大维度的多项分布类型观测对该概率向量的一个汇总参数进行推断。多样性的常见汇总指标包括克隆性和熵。之前已经提出了一种基于来自同一血样的多个重复样本的克隆性点估计量。因此,剩下的挑战是构建参数的置信区间以反映其不确定性。在本文中,我们建议将经验贝叶斯方法与基于重采样的校准程序相结合,为不同的群体多样性参数构建一个稳健的置信区间。通过广泛的数值研究和实际数据示例对该方法进行了说明。