Cai Zhanrui, Lei Jing, Roeder Kathryn
Faculty of Business and Economics, The University of Hong Kong.
Department of Statistics and Data Science, Carnegie Mellon University.
J Am Stat Assoc. 2024;119(547):1794-1804. doi: 10.1080/01621459.2023.2218030. Epub 2023 Dec 21.
Test of independence is of fundamental importance in modern data analysis, with broad applications in variable selection, graphical models, and causal inference. When the data is high dimensional and the potential dependence signal is sparse, independence testing becomes very challenging without distributional or structural assumptions. In this paper, we propose a general framework for independence testing by first fitting a classifier that distinguishes the joint and product distributions, and then testing the significance of the fitted classifier. This framework allows us to borrow the strength of the most advanced classification algorithms developed from the modern machine learning community, making it applicable to high dimensional, complex data. By combining a sample split and a fixed permutation, our test statistic has a universal, fixed Gaussian null distribution that is independent of the underlying data distribution. Extensive simulations demonstrate the advantages of the newly proposed test compared with existing methods. We further apply the new test to a single cell data set to test the independence between two types of single cell sequencing measurements, whose high dimensionality and sparsity make existing methods hard to apply.
独立性检验在现代数据分析中至关重要,在变量选择、图形模型和因果推断等方面有广泛应用。当数据是高维的且潜在的依赖信号稀疏时,在没有分布或结构假设的情况下,独立性检验变得非常具有挑战性。在本文中,我们提出了一个用于独立性检验的通用框架,首先拟合一个区分联合分布和乘积分布的分类器,然后检验拟合分类器的显著性。这个框架使我们能够借助现代机器学习社区开发的最先进分类算法的优势,使其适用于高维、复杂的数据。通过结合样本分割和固定排列,我们的检验统计量具有通用的、固定的高斯零分布,该分布与基础数据分布无关。大量模拟表明,新提出的检验方法与现有方法相比具有优势。我们进一步将新检验应用于一个单细胞数据集,以检验两种单细胞测序测量之间的独立性,其高维度和稀疏性使得现有方法难以应用。