Gorsky S, Ma L
Department of Mathematics and Statistics, University of Massachusetts Amherst, 710 N. Pleasant Street, Amherst, Massachusetts 01003, U.S.A.
Department of Statistical Science, Duke University, Box 90251, Durham, North Carolina 27708, U.S.A.
Biometrika. 2022 Sep;109(3):569-587. doi: 10.1093/biomet/asac013. Epub 2022 Feb 21.
Identifying dependency in multivariate data is a common inference task that arises in numerous applications. However, existing nonparametric independence tests typically require computation that scales at least quadratically with the sample size, making it difficult to apply them in the presence of massive sample sizes. Moreover, resampling is usually necessary to evaluate the statistical significance of the resulting test statistics at finite sample sizes, further worsening the computational burden. We introduce a scalable, resampling-free approach to testing the independence between two random vectors by breaking down the task into simple univariate tests of independence on a collection of 2 × 2 contingency tables constructed through sequential coarse-to-fine discretization of the sample space, transforming the inference task into a multiple testing problem that can be completed with almost linear complexity with respect to the sample size. To address increasing dimensionality, we introduce a coarse-to-fine sequential adaptive procedure that exploits the spatial features of dependency structures. We derive a finite-sample theory that guarantees the inferential validity of our adaptive procedure at any given sample size. We show that our approach can achieve strong control of the level of the testing procedure at any sample size without resampling or asymptotic approximation and establish its large-sample consistency. We demonstrate through an extensive simulation study its substantial computational advantage in comparison to existing approaches while achieving robust statistical power under various dependency scenarios, and illustrate how its divide-and-conquer nature can be exploited to not just test independence, but to learn the nature of the underlying dependency. Finally, we demonstrate the use of our method through analysing a dataset from a flow cytometry experiment.
识别多元数据中的依赖性是众多应用中常见的推理任务。然而,现有的非参数独立性检验通常需要至少与样本量呈二次方增长的计算量,这使得在样本量巨大的情况下难以应用它们。此外,在有限样本量下评估所得检验统计量的统计显著性通常需要重采样,这进一步加重了计算负担。我们引入一种可扩展的、无需重采样的方法来检验两个随机向量之间的独立性,通过将任务分解为对通过样本空间的顺序粗到细离散化构建的一组2×2列联表进行简单的单变量独立性检验,将推理任务转化为一个多重检验问题,该问题相对于样本量几乎可以以线性复杂度完成。为了解决维度增加的问题,我们引入一种粗到细的顺序自适应程序,该程序利用依赖性结构的空间特征。我们推导了一个有限样本理论,该理论保证了我们的自适应程序在任何给定样本量下的推理有效性。我们表明,我们的方法可以在任何样本量下实现对检验程序水平的强控制,而无需重采样或渐近近似,并建立其大样本一致性。我们通过广泛的模拟研究证明了它与现有方法相比的显著计算优势,同时在各种依赖性场景下实现了强大的统计功效,并说明了如何利用其分而治之的性质不仅可以检验独立性,还可以了解潜在依赖性的性质。最后,我们通过分析一个来自流式细胞术实验的数据集展示了我们方法的应用。