高维数据的渐近无分布独立性检验

Asymptotic Distribution-Free Independence Test for High Dimension Data.

作者信息

Cai Zhanrui, Lei Jing, Roeder Kathryn

机构信息

Faculty of Business and Economics, The University of Hong Kong.

Department of Statistics and Data Science, Carnegie Mellon University.

出版信息

J Am Stat Assoc. 2024;119(547):1794-1804. doi: 10.1080/01621459.2023.2218030. Epub 2023 Dec 21.

DOI:10.1080/01621459.2023.2218030

PMID:39651450

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11620790/

Abstract

Test of independence is of fundamental importance in modern data analysis, with broad applications in variable selection, graphical models, and causal inference. When the data is high dimensional and the potential dependence signal is sparse, independence testing becomes very challenging without distributional or structural assumptions. In this paper, we propose a general framework for independence testing by first fitting a classifier that distinguishes the joint and product distributions, and then testing the significance of the fitted classifier. This framework allows us to borrow the strength of the most advanced classification algorithms developed from the modern machine learning community, making it applicable to high dimensional, complex data. By combining a sample split and a fixed permutation, our test statistic has a universal, fixed Gaussian null distribution that is independent of the underlying data distribution. Extensive simulations demonstrate the advantages of the newly proposed test compared with existing methods. We further apply the new test to a single cell data set to test the independence between two types of single cell sequencing measurements, whose high dimensionality and sparsity make existing methods hard to apply.

摘要

独立性检验在现代数据分析中至关重要，在变量选择、图形模型和因果推断等方面有广泛应用。当数据是高维的且潜在的依赖信号稀疏时，在没有分布或结构假设的情况下，独立性检验变得非常具有挑战性。在本文中，我们提出了一个用于独立性检验的通用框架，首先拟合一个区分联合分布和乘积分布的分类器，然后检验拟合分类器的显著性。这个框架使我们能够借助现代机器学习社区开发的最先进分类算法的优势，使其适用于高维、复杂的数据。通过结合样本分割和固定排列，我们的检验统计量具有通用的、固定的高斯零分布，该分布与基础数据分布无关。大量模拟表明，新提出的检验方法与现有方法相比具有优势。我们进一步将新检验应用于一个单细胞数据集，以检验两种单细胞测序测量之间的独立性，其高维度和稀疏性使得现有方法难以应用。

相似文献

Asymptotic Distribution-Free Independence Test for High Dimension Data.高维数据的渐近无分布独立性检验

J Am Stat Assoc. 2024;119(547):1794-1804. doi: 10.1080/01621459.2023.2218030. Epub 2023 Dec 21.

Model-free prediction test with application to genomics data.无模型预测检验及其在基因组学数据中的应用。

Proc Natl Acad Sci U S A. 2022 Aug 23;119(34):e2205518119. doi: 10.1073/pnas.2205518119. Epub 2022 Aug 15.

Nonparametric Causal Structure Learning in High Dimensions.高维非参数因果结构学习

Entropy (Basel). 2022 Feb 28;24(3):351. doi: 10.3390/e24030351.

ASYMPTOTIC DISTRIBUTIONS OF HIGH-DIMENSIONAL DISTANCE CORRELATION INFERENCE.高维距离相关性推断的渐近分布

Ann Stat. 2021 Aug;49(4):1999-2020. doi: 10.1214/20-aos2024. Epub 2021 Sep 29.

Testing generalized linear models with high-dimensional nuisance parameter.检验具有高维干扰参数的广义线性模型。

Biometrika. 2023 Mar;110(1):83-99. doi: 10.1093/biomet/asac021. Epub 2022 Apr 5.

Kernel-Based Independence Tests for Causal Structure Learning on Functional Data.基于核的函数型数据因果结构学习的独立性检验

Entropy (Basel). 2023 Nov 28;25(12):1597. doi: 10.3390/e25121597.

Joint Learning of Multiple Sparse Matrix Gaussian Graphical Models.联合学习多个稀疏矩阵高斯图模型。

IEEE Trans Neural Netw Learn Syst. 2015 Nov;26(11):2606-20. doi: 10.1109/TNNLS.2014.2384201. Epub 2015 Mar 4.

A Blockwise Bootstrap-Based Two-Sample Test for High-Dimensional Time Series.一种基于分块自助法的高维时间序列两样本检验

Entropy (Basel). 2024 Mar 1;26(3):226. doi: 10.3390/e26030226.

Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models.稀疏超高维变系数模型中的非参数独立性筛选

J Am Stat Assoc. 2014;109(507):1270-1284. doi: 10.1080/01621459.2013.879828.

Testing conditional quantile independence with functional covariate.使用函数协变量检验条件分位数独立性。

Biometrics. 2024 Mar 27;80(2). doi: 10.1093/biomtc/ujae036.

本文引用的文献

scMoC: single-cell multi-omics clustering.scMoC：单细胞多组学聚类

Bioinform Adv. 2022 Feb 15;2(1):vbac011. doi: 10.1093/bioadv/vbac011. eCollection 2022.

Model-free prediction test with application to genomics data.无模型预测检验及其在基因组学数据中的应用。

Proc Natl Acad Sci U S A. 2022 Aug 23;119(34):e2205518119. doi: 10.1073/pnas.2205518119. Epub 2022 Aug 15.

Causal discoveries for high dimensional mixed data.高维混合数据的因果发现。

Stat Med. 2022 Oct 30;41(24):4924-4940. doi: 10.1002/sim.9544. Epub 2022 Aug 15.

ASYMPTOTIC DISTRIBUTIONS OF HIGH-DIMENSIONAL DISTANCE CORRELATION INFERENCE.高维距离相关性推断的渐近分布

Ann Stat. 2021 Aug;49(4):1999-2020. doi: 10.1214/20-aos2024. Epub 2021 Sep 29.

Cauchy combination test: a powerful test with analytic -value calculation under arbitrary dependency structures.柯西组合检验：一种在任意相依结构下具有解析值计算功能的强大检验。

J Am Stat Assoc. 2020;115(529):393-402. doi: 10.1080/01621459.2018.1554485. Epub 2019 Apr 25.

Universal inference.普遍推断。

Proc Natl Acad Sci U S A. 2020 Jul 21;117(29):16880-16890. doi: 10.1073/pnas.1922664117. Epub 2020 Jul 6.

From reads to insight: a hitchhiker's guide to ATAC-seq data analysis.从读取到洞察：ATAC-seq 数据分析入门指南。

Genome Biol. 2020 Feb 3;21(1):22. doi: 10.1186/s13059-020-1929-3.

Single-cell biology: beyond the sum of its parts.单细胞生物学：超越其各部分之和。

Nat Methods. 2020 Jan;17(1):17-20. doi: 10.1038/s41592-019-0693-3.

Single-cell multimodal omics: the power of many.单细胞多组学：众多个体的力量。

Nat Methods. 2020 Jan;17(1):11-14. doi: 10.1038/s41592-019-0691-5.

Beyond bulk: a review of single cell transcriptomics methodologies and applications.超越细胞团块：单细胞转录组学方法学与应用述评。

Curr Opin Biotechnol. 2019 Aug;58:129-136. doi: 10.1016/j.copbio.2019.03.001. Epub 2019 Apr 10.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验