Department of Electrical Engineering, Stanford University, Palo Alto, 94304, USA.
Department of Biomedical Data Science, Stanford University, Palo Alto, 94304, USA.
Nat Commun. 2019 Jul 31;10(1):3433. doi: 10.1038/s41467-019-11247-0.
Multiple hypothesis testing is an essential component of modern data science. In many settings, in addition to the p-value, additional covariates for each hypothesis are available, e.g., functional annotation of variants in genome-wide association studies. Such information is ignored by popular multiple testing approaches such as the Benjamini-Hochberg procedure (BH). Here we introduce AdaFDR, a fast and flexible method that adaptively learns the optimal p-value threshold from covariates to significantly improve detection power. On eQTL analysis of the GTEx data, AdaFDR discovers 32% more associations than BH at the same false discovery rate. We prove that AdaFDR controls false discovery proportion and show that it makes substantially more discoveries while controlling false discovery rate (FDR) in extensive experiments. AdaFDR is computationally efficient and allows multi-dimensional covariates with both numeric and categorical values, making it broadly useful across many applications.
多假设检验是现代数据科学的一个重要组成部分。在许多情况下,除了 p 值之外,每个假设都有额外的协变量可用,例如全基因组关联研究中变体的功能注释。流行的多重检验方法(如 Benjamini-Hochberg 程序(BH))忽略了此类信息。在这里,我们引入了 AdaFDR,这是一种快速灵活的方法,它自适应地从协变量中学习最佳 p 值阈值,从而显著提高检测能力。在 GTEx 数据的 eQTL 分析中,AdaFDR 比 BH 在相同的误报率下发现了 32%更多的关联。我们证明了 AdaFDR 可以控制错误发现率,并表明它在广泛的实验中通过控制错误发现率(FDR)可以做出更多的发现。AdaFDR 计算效率高,允许使用具有数值和分类值的多维协变量,因此在许多应用中都非常有用。