先验信息、群体大小与全基因组假设检验中的功效。

Priors, population sizes, and power in genome-wide hypothesis tests.

机构信息

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA.

Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, 21218, USA.

出版信息

BMC Bioinformatics. 2023 Apr 26;24(1):170. doi: 10.1186/s12859-023-05261-9.

DOI:10.1186/s12859-023-05261-9

PMID:37101120

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10134629/

Abstract

BACKGROUND

Genome-wide tests, including genome-wide association studies (GWAS) of germ-line genetic variants, driver tests of cancer somatic mutations, and transcriptome-wide association tests of RNAseq data, carry a high multiple testing burden. This burden can be overcome by enrolling larger cohorts or alleviated by using prior biological knowledge to favor some hypotheses over others. Here we compare these two methods in terms of their abilities to boost the power of hypothesis testing.

RESULTS

We provide a quantitative estimate for progress in cohort sizes and present a theoretical analysis of the power of oracular hard priors: priors that select a subset of hypotheses for testing, with an oracular guarantee that all true positives are within the tested subset. This theory demonstrates that for GWAS, strong priors that limit testing to 100-1000 genes provide less power than typical annual 20-40% increases in cohort sizes. Furthermore, non-oracular priors that exclude even a small fraction of true positives from the tested set can perform worse than not using a prior at all.

CONCLUSION

Our results provide a theoretical explanation for the continued dominance of simple, unbiased univariate hypothesis tests for GWAS: if a statistical question can be answered by larger cohort sizes, it should be answered by larger cohort sizes rather than by more complicated biased methods involving priors. We suggest that priors are better suited for non-statistical aspects of biology, such as pathway structure and causality, that are not yet easily captured by standard hypothesis tests.

摘要

背景

全基因组测试，包括胚系遗传变异的全基因组关联研究（GWAS）、癌症体细胞突变的驱动测试以及 RNAseq 数据的转录组关联测试，都存在很高的多重测试负担。这种负担可以通过招募更大的队列来克服，也可以通过利用先前的生物学知识来支持某些假设而不是其他假设来缓解。在这里，我们比较了这两种方法在增强假设检验能力方面的效果。

结果

我们提供了一种定量估计，用于衡量队列大小的进展，并对预测先验的功效进行了理论分析：预测先验选择了一组用于测试的假设，具有一个预测先验的保证，即所有的真阳性都在测试的假设中。该理论表明，对于 GWAS，将测试限制在 100-1000 个基因的强预测先验提供的功效不如每年增加 20-40%的典型队列大小。此外，即使排除了测试集中一小部分真阳性的非预测先验，其效果也可能不如完全不使用预测先验。

结论

我们的结果为 GWAS 中简单、无偏的单变量假设检验的持续主导地位提供了理论解释：如果一个统计问题可以通过更大的队列大小来回答，那么它应该通过更大的队列大小来回答，而不是通过涉及预测先验的更复杂的有偏方法来回答。我们建议预测先验更适合生物学的非统计方面，例如途径结构和因果关系，这些方面还不容易被标准的假设检验所捕捉。