Baggerly K A
Department of Biostatistics, M. D. Anderson Cancer Center, Houston, Texas 77030-4009, USA.
Cytometry. 2001 Oct 1;45(2):141-50. doi: 10.1002/1097-0320(20011001)45:2<141::aid-cyto1156>3.0.co;2-m.
A key problem in immunohistochemistry is assessing when two sample histograms are significantly different. One test that is commonly used for this purpose in the univariate case is the chi-squared test. Comparing multivariate distributions is qualitatively harder, as the "curse of dimensionality" means that the number of bins can grow exponentially. For the chi-squared test to be useful, data-dependent binning methods must be employed. An example of how this can be done is provided by the "probability binning" method of Roederer et al. (1,2,3).
We derive the theoretical distribution of the probability binning statistic, giving it a more rigorous foundation. We show that the null distribution is a scaled chi-square, and show how it can be related to the standard chi-squared statistic.
A small simulation shows how the theoretical results can be used to (a) modify the probability binning statistic to make it more sensitive and (b) suggest variant statistics which, while still exploiting the data-dependent strengths of the probability binning procedure, may be easier to work with.
The probability binning procedure effectively uses adaptive binning to locate structure in high-dimensional data. The derivation of a theoretical basis provides a more detailed interpretation of its behavior and renders the probability binning method more flexible.
免疫组织化学中的一个关键问题是评估两个样本直方图何时存在显著差异。在单变量情况下,常用于此目的的一种检验是卡方检验。比较多变量分布在定性上更难,因为“维度诅咒”意味着箱数会呈指数增长。为使卡方检验有用,必须采用依赖数据的分箱方法。Roederer等人(1,2,3)的“概率分箱”方法提供了一个如何做到这一点的示例。
我们推导了概率分箱统计量的理论分布,为其提供了更严格的基础。我们表明零分布是一个缩放后的卡方分布,并展示了它如何与标准卡方统计量相关。
一个小型模拟展示了理论结果如何用于(a)修改概率分箱统计量以使其更敏感,以及(b)提出变体统计量,这些统计量虽然仍利用概率分箱过程中依赖数据的优势,但可能更易于使用。
概率分箱过程有效地利用自适应分箱来定位高维数据中的结构。理论基础的推导为其行为提供了更详细的解释,并使概率分箱方法更灵活。