Aw Alan J, Spence Jeffrey P, Song Yun S
Department of Statistics, University of California, Berkeley.
Department of Genetics, School of Medicine, Stanford University.
Ann Appl Stat. 2024 Mar;18(1):858-881. doi: 10.1214/23-aoas1817. Epub 2024 Jan 31.
In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics, these considerations are fundamental to downstream tasks such as demographic inference and the construction of polygenic risk scores. We propose a non-parametric approach, which we call the V test, to address these two questions, namely, a test of sample exchangeability given dependency structure of features, and a test of feature independence given sample exchangeability. Our test is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. Through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our test compares favorably in various scenarios of interest. We apply the test to data from the 1000 Genomes Project, demonstrating how it can be employed to assess exchangeability of the genetic sample, or find optimal linkage disequilibrium (LD) splits for downstream analysis. For exchangeability assessment, we find that removing rare variants can substantially increase the -value of the test statistic. For optimal LD splitting, the V test reports different optimal splits than previous approaches not relying on hypothesis testing. Software for our methods is available in R (CRAN: flintyR) and Python (PyPI: flintyPy).
在涉及多变量数据分析的科学研究中,研究人员常常会遇到一些基本但重要的问题:样本是否可交换,即样本的联合分布对于单元的排序是否不变?特征之间是否相互独立,或者是否可以将特征分组,使得这些组相互独立?在统计基因组学中,这些考量对于诸如人口推断和多基因风险评分构建等下游任务至关重要。我们提出一种非参数方法,我们称之为V检验,以解决这两个问题,即给定特征依赖结构时的样本可交换性检验,以及给定样本可交换性时的特征独立性检验。我们的检验在概念上很简单,但快速且灵活。它在实际场景中控制第一类错误,并通过利用大样本渐近性来处理任意维度的数据。通过广泛的模拟以及与基于随机矩阵理论的无监督分层检验进行比较,我们发现在各种感兴趣的场景中,我们的检验表现良好。我们将该检验应用于千人基因组计划的数据,展示了它如何用于评估遗传样本的可交换性,或为下游分析找到最优的连锁不平衡(LD)划分。对于可交换性评估,我们发现去除罕见变异可以显著提高检验统计量的p值。对于最优LD划分,V检验报告的最优划分与以往不依赖假设检验的方法不同。我们方法的软件可在R(CRAN:flintyR)和Python(PyPI:flintyPy)中获取。