Chu Benjamin B, He Zihuai, Sabatti Chiara
bioRxiv. 2025 Jul 9:2025.06.05.658138. doi: 10.1101/2025.06.05.658138.
The standard analysis pipeline for genome-wide association studies (GWAS) is based on marginal tests of association. These are computationally convenient and portable, but the discoveries resulting from their rejections are not immediately interpretable, and require post-processing as "clumping" and "fine mapping." An interesting alternative is provided by conditional independence hypotheses: their rejections lead to the identification of distinct signals across the genome, accounting for measured confounders, and pointing to separate causal pathways. An obstacle to the wide adoption of this approach has been that it requires access to individual level data. Overcoming this barrier, recent work has shown how summary statistics resulting from the standard marginal GWAS analysis can be used as input of a procedure to test conditional independence hypotheses while controlling the false discovery rate. This secondary analysis requires sampling of synthetic negative controls (knockoffs) from a distribution determined by the linkage disequilibrium patterns in the genome of the population under study. In prior work, we have pre-computed this distribution for European genomes, starting from information derived from the UK Biobank. Thus, researchers working with GWAS in a European population can carry out a knockoff analysis with minimal computational costs, using the distributed routine GhostKnockoffGWAS . Here we introduce and release a new software ( solveblock ) that extends this capability to a much richer collection of studies. Given a set of genotyped samples, or a reference dataset, our pipeline efficiently estimates the high-dimensional correlation matrices that describe dependencies across the genome, making rather common sparsity assumptions. Taking this sample-specific estimate as input, the software identifies groups of genetic variants that are highly correlated, and uses them to define an appropriate resolution for conditional independence hypotheses. Finally, we compute the distribution for the exchangeable negative controls necessary to test these hypotheses. The output of solveblock can be passed directly to GhostKnockoffGWAS , allowing users to carry out the complete analysis in a two step procedure. Simulations, based on five UK Biobank sub-populations, illustrate the method's FDR control. The analysis of 26 phenotypes of varying polygenicity in British individuals, results in ≈ 19 additional discoveries, compared to standard marginal association testing. Our code, precompiled software, and processed files for these five sub-populations are openly shared.
全基因组关联研究(GWAS)的标准分析流程基于关联的边际检验。这些检验在计算上方便且具有可移植性,但因拒绝原假设而产生的发现并不能立即得到解释,需要进行“聚类”和“精细定位”等后处理。条件独立性假设提供了一种有趣的替代方法:拒绝这些假设会导致识别全基因组中不同的信号,考虑到测量的混杂因素,并指向不同的因果途径。这种方法广泛应用的一个障碍是它需要访问个体水平的数据。为克服这一障碍,最近的研究表明,标准边际GWAS分析产生的汇总统计量可如何用作检验条件独立性假设的程序的输入,同时控制错误发现率。这种二次分析需要从由所研究人群基因组中的连锁不平衡模式确定的分布中对合成阴性对照(仿制品)进行抽样。在之前的工作中,我们从英国生物银行获得的信息出发,预先计算了欧洲基因组的这种分布。因此,在欧洲人群中进行GWAS研究的人员可以使用分布式程序GhostKnockoffGWAS以最小的计算成本进行仿制品分析。在这里,我们引入并发布了一个新软件(solveblock),将这种能力扩展到更丰富的研究集合。给定一组基因分型样本或一个参考数据集,我们的流程有效地估计描述全基因组依赖性的高维相关矩阵,做出相当常见的稀疏性假设。将这个特定样本的估计作为输入,该软件识别高度相关的基因变异组,并使用它们为条件独立性假设定义适当的分辨率。最后,我们计算检验这些假设所需的可交换阴性对照分布。solveblock的输出可以直接传递给GhostKnockoffGWAS,允许用户通过两步程序进行完整分析。基于五个英国生物银行亚群的模拟说明了该方法对错误发现率的控制。与标准边际关联检验相比,对英国个体中26种不同多基因性的表型进行分析,大约多发现了19个结果。我们为这五个亚群编写的代码、预编译软件和处理后的文件都是公开共享的。