多重假设检验中的混杂因素调整

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING.

作者信息

Wang Jingshu, Zhao Qingyuan, Hastie Trevor, Owen Art B

机构信息

Department of Statistics, The Wharton School, University of Pennsylvania, 400 Huntsman Hall, 3730 Walnut St, Philadelphia, Pennsylvania 19104, USA.

Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California 94305, USA.

出版信息

Ann Stat. 2017 Oct;45(5):1863-1894. doi: 10.1214/16-AOS1511. Epub 2017 Oct 31.

DOI:10.1214/16-AOS1511

PMID:31439967

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6706069/

Abstract

We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g., treatment variable, phenotype) and the outcome. Over the past decade, many statistical methods have been proposed to adjust for the confounders in hypothesis testing. We unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables, and analyze their statistical properties. In particular, we provide theoretical guarantees for RUV-4 [Gagnon-Bartsch, Jacob and Speed (2013)] and LEAPP [ (2012) 1664-1688], which correspond to two different identification conditions in the framework: the first requires a set of "negative controls" that are known a priori to follow the null distribution; the second requires the true nonnulls to be sparse. Two different estimators which are based on RUV-4 and LEAPP are then applied to these two scenarios. We show that if the confounding factors are strong, the resulting estimators can be asymptotically as powerful as the oracle estimator which observes the latent confounding factors. For hypothesis testing, we show the asymptotic -tests based on the estimators can control the type I error. Numerical experiments show that the false discovery rate is also controlled by the Benjamini-Hochberg procedure when the sample size is reasonably large.

摘要

我们考虑进行数千次显著性检验同时进行的大规模研究。在其中一些研究中，多重检验程序可能会受到潜在混杂因素的严重偏差影响，例如批次效应和与感兴趣的主要变量（例如治疗变量、表型）和结果都相关的未测量协变量。在过去十年中，已经提出了许多统计方法来在假设检验中调整混杂因素。我们将这些方法统一在同一个框架中，将它们推广到包括多个主要变量和多个干扰变量，并分析它们的统计特性。特别是，我们为RUV - 4 [加尼翁 - 巴尔施、雅各布和斯皮德（2013年）] 和LEAPP [ （2012年）1664 - 1688] 提供了理论保证，它们对应于该框架中的两种不同识别条件：第一种需要一组先验已知遵循零分布的“阴性对照”；第二种要求真正的非零值是稀疏的。然后将基于RUV - 4和LEAPP的两种不同估计器应用于这两种情况。我们表明，如果混杂因素很强，那么得到的估计器在渐近意义上可以与观察到潜在混杂因素的神谕估计器一样强大。对于假设检验，我们表明基于这些估计器的渐近检验可以控制第一类错误。数值实验表明，当样本量足够大时，错误发现率也可以通过本雅明尼 - 霍赫贝格程序得到控制。