Roederer M, Treister A, Moore W, Herzenberg L A
Vaccine Research Center, NIH, Bethesda, Maryland 20892-3015, USA.
Cytometry. 2001 Sep 1;45(1):37-46. doi: 10.1002/1097-0320(20010901)45:1<37::aid-cyto1142>3.0.co;2-e.
Comparing distributions of data is an important goal in many applications. For example, determining whether two samples (e.g., a control and test sample) are statistically significantly different is useful to detect a response, or to provide feedback regarding instrument stability by detecting when collected data varies significantly over time.
We apply a variant of the chi-squared statistic to comparing univariate distributions. In this variant, a control distribution is divided such that an equal number of events fall into each of the divisions, or bins. This approach is thereby a mini-max algorithm, in that it minimizes the maximum expected variance for the control distribution. The control-derived bins are then applied to test sample distributions, and a normalized chi-squared value is computed. We term this algorithm Probability Binning.
Using a Monte-Carlo simulation, we determined the distribution of chi-squared values obtained by comparing sets of events derived from the same distribution. Based on this distribution, we derive a conversion of any given chi-squared value into a metric that is analogous to a t-score, i.e., it can be used to estimate the probability that a test distribution is different from a control distribution. We demonstrate that this metric scales with the difference between two distributions, and can be used to rank samples according to similarity to a control. Finally, we demonstrate the applicability of this metric to ranking immunophenotyping distributions to suggest that it indeed can be used to objectively determine the relative distance of distributions compared to a single control.
Probability Binning, as shown here, provides a useful metric for determining the probability that two or more flow cytometric data distributions are different. This metric can also be used to rank distributions to identify which are most similar or dissimilar. In addition, the algorithm can be used to quantitate contamination of even highly-overlapping populations. Finally, as demonstrated in an accompanying paper, Probability Binning can be used to gate on events that represent significantly different subsets from a control sample. Published 2001 Wiley-Liss, Inc.
在许多应用中,比较数据分布是一个重要目标。例如,确定两个样本(如对照样本和测试样本)在统计学上是否存在显著差异,对于检测反应或通过检测收集的数据随时间的显著变化来提供有关仪器稳定性的反馈很有用。
我们应用卡方统计量的一种变体来比较单变量分布。在这种变体中,将对照分布进行划分,使得每个划分(或区间)中的事件数量相等。因此,这种方法是一种最小 - 最大算法,因为它使对照分布的最大预期方差最小化。然后将从对照得出的区间应用于测试样本分布,并计算归一化卡方值。我们将此算法称为概率区间划分。
通过蒙特卡罗模拟,我们确定了通过比较来自相同分布的事件集所获得的卡方值的分布。基于此分布,我们将任何给定的卡方值转换为类似于t分数的度量,即它可用于估计测试分布与对照分布不同的概率。我们证明此度量随两个分布之间的差异而变化,并且可用于根据与对照的相似性对样本进行排名。最后,我们证明了此度量在对免疫表型分布进行排名方面的适用性,表明它确实可用于客观确定与单个对照相比分布的相对距离。
如本文所示,概率区间划分提供了一个有用的度量,用于确定两个或多个流式细胞术数据分布不同的概率。此度量还可用于对分布进行排名,以识别哪些分布最相似或最不相似。此外,该算法可用于定量甚至高度重叠群体的污染。最后,如随附论文所示,概率区间划分可用于对代表与对照样本有显著差异的子集的事件进行设门。2001年由Wiley - Liss公司出版。