Manolopoulou Ioanna, Chan Cliburn, West Mike
Department of Statistical Science, Duke University, Durham, NC,
Bayesian Anal. 2010;5(3):1-22.
One of the challenges in using Markov chain Monte Carlo for model analysis in studies with very large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data from a mixture model provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full dataset is used to guide selection sampling of a further set of observations targeted at a scientifically interesting, low probability region. We define a Sequential Monte Carlo strategy in which the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the size of the targeted subsample. An example from flow cytometry illustrates the ability of the approach to increase the resolution of inferences for rare cell subtypes.
在使用马尔可夫链蒙特卡罗方法对超大型数据集进行模型分析时,其中一个挑战是在采样器的每次迭代中都需要扫描整个数据,这在计算上可能是难以承受的。已经开发了几种方法来解决这个问题,通常是抽取计算上易于处理的数据子样本。在这里,我们考虑一种特殊情况,即混合模型中的大部分数据几乎没有或根本没有提供关于感兴趣参数的信息,我们的目标是选择子样本,以便提取的信息最相关。激发我们开展这项研究的应用场景来自流式细胞术,在该技术中,可以获得大量细胞的多项测量数据。我们感兴趣的是识别特定的罕见细胞亚型,并根据其相应的标志物对它们进行表征。我们提出了一种马尔可夫链蒙特卡罗方法,其中完整数据集的初始子样本用于指导针对科学上有趣的低概率区域的另一组观测值的选择采样。我们定义了一种序贯蒙特卡罗策略,在该策略中,随着估计的改进,目标子样本会依次增加,并引入了一个停止规则来确定目标子样本的大小。流式细胞术的一个例子说明了该方法提高对罕见细胞亚型推断分辨率的能力。