Chen Zhongren, Tian Lu, Olshen Richard A
Department of Statistics and Data Science, Yale University, New Haven, CT, USA.
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
J Appl Stat. 2025 Apr 30. doi: 10.1080/02664763.2025.2496724.
This paper is motivated by the need to quantify human immune responses to environmental challenges. Specifically, the genome of the selected cell population from a blood sample is amplified by the PCR process, producing a large number of reads. Each read corresponds to a particular rearrangement of so-called V(D)J sequences. The observed data consist of a set of integers, representing numbers of reads corresponding to different V(D)J sequences. The underlying relative frequencies of distinct V(D)J sequences can be summarized by a probability vector, with the cardinality being the number of distinct V(D)J rearrangements. The statistical question is to make inferences on a summary parameter of this probability vector based on a multinomial-type observation of a large dimension. Popular summaries of the diversity include clonality and entropy. A point estimator of the clonality based on multiple replicates from the same blood sample has been proposed previously. Therefore, the remaining challenge is to construct confidence intervals of the parameters to reflect their uncertainty. In this paper, we propose to couple the Empirical Bayes method with a resampling-based calibration procedure to construct a robust confidence interval for different population diversity parameters. The method is illustrated via extensive numerical studies and real data examples.