Sun Ryan, McCaw Zachary R, Lin Xihong
Department of Biostatistics at MD Anderson Cancer Center.
Senior Machine Learning Scientist at Insitro.
J Am Stat Assoc. 2025;120(550):605-617. doi: 10.1080/01621459.2024.2422124. Epub 2024 Dec 5.
Causal mediation, pleiotropy, and replication analyses are three highly popular genetic study designs. Although these analyses address different scientific questions, the underlying statistical inference problems all involve large-scale testing of composite null hypotheses. The goal is to determine whether all null hypotheses - as opposed to at least one - in a set of individual tests should simultaneously be rejected. Recently, various methods have been proposed for each of these situations, including an appealing two-group empirical Bayes approach that calculates local false discovery rates (lfdr). However, lfdr estimation is difficult due to the need for multivariate density estimation. Furthermore, the multiple testing rules for the empirical Bayes lfdr approach can disagree with conventional frequentist z-statistics, which is troubling for a field that ubiquitously utilizes summary statistics. This work proposes a framework to unify two-group testing in genetic association composite null settings, the conditionally symmetric multidimensional Gaussian mixture model (csmGmm). The csmGmm is shown to demonstrate more robust operating characteristics than recently-proposed alternatives. Crucially, the csmGmm also offers interpretability guarantees by harmonizing lfdr and z-statistic testing rules. We extend the base csmGmm to cover each of the mediation, pleiotropy, and replication settings, and we prove that the lfdr z-statistic agreement holds in each situation. We apply the model to a collection of translational lung cancer genetic association studies that motivated this work.
因果中介分析、多效性分析和重复分析是三种非常流行的基因研究设计。尽管这些分析解决的是不同的科学问题,但潜在的统计推断问题都涉及对复合零假设的大规模检验。目标是确定在一组单独的检验中,所有零假设(而不是至少一个)是否应同时被拒绝。最近,针对这些情况中的每一种都提出了各种方法,包括一种有吸引力的两组经验贝叶斯方法,该方法计算局部错误发现率(lfdr)。然而,由于需要进行多元密度估计,lfdr估计很困难。此外,经验贝叶斯lfdr方法的多重检验规则可能与传统的频率主义z统计量不一致,这对于一个普遍使用汇总统计量的领域来说是个麻烦。这项工作提出了一个框架,以统一基因关联复合零假设设置中的两组检验,即条件对称多维高斯混合模型(csmGmm)。结果表明,csmGmm比最近提出的替代方法具有更稳健的操作特性。至关重要的是,csmGmm还通过协调lfdr和z统计量检验规则提供了可解释性保证。我们扩展了基本的csmGmm以涵盖中介分析、多效性分析和重复分析的每种设置,并证明在每种情况下lfdr与z统计量的一致性都成立。我们将该模型应用于一系列推动这项工作的转化肺癌基因关联研究。