Department of Biostatistics, Yale School of Public Health, New Haven, CT 06510, USA.
Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT 06510, USA.
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbaa442.
Genetic correlation is the correlation of phenotypic effects by genetic variants across the genome on two phenotypes. It is an informative metric to quantify the overall genetic similarity between complex traits, which provides insights into their polygenic genetic architecture. Several methods have been proposed to estimate genetic correlation based on data collected from genome-wide association studies (GWAS). Due to the easy access of GWAS summary statistics and computational efficiency, methods only requiring GWAS summary statistics as input have become more popular than methods utilizing individual-level genotype data. Here, we present a benchmark study for different summary-statistics-based genetic correlation estimation methods through simulation and real data applications. We focus on two major technical challenges in estimating genetic correlation: marker dependency caused by linkage disequilibrium (LD) and sample overlap between different studies. To assess the performance of different methods in the presence of these two challenges, we first conducted comprehensive simulations with diverse LD patterns and sample overlaps. Then we applied these methods to real GWAS summary statistics for a wide spectrum of complex traits. Based on these experiments, we conclude that methods relying on accurate LD estimation are less robust in real data applications due to the imprecision of LD obtained from reference panels. Our findings offer guidance on how to choose appropriate methods for genetic correlation estimation in post-GWAS analysis.
遗传相关是指基因组中遗传变异对两种表型的表型效应的相关性。它是量化复杂性状之间整体遗传相似性的一种有用指标,为其多基因遗传结构提供了深入了解。已经提出了几种基于从全基因组关联研究(GWAS)收集的数据估计遗传相关的方法。由于 GWAS 汇总统计数据易于获取和计算效率高,因此仅需要 GWAS 汇总统计数据作为输入的方法比利用个体水平基因型数据的方法更受欢迎。在这里,我们通过模拟和真实数据应用对不同基于汇总统计数据的遗传相关估计方法进行了基准研究。我们重点关注估计遗传相关时面临的两个主要技术挑战:由连锁不平衡(LD)引起的标记依赖性和不同研究之间的样本重叠。为了评估在存在这两个挑战的情况下不同方法的性能,我们首先进行了具有不同 LD 模式和样本重叠的综合模拟。然后,我们将这些方法应用于广泛的复杂性状的真实 GWAS 汇总统计数据。基于这些实验,我们得出结论,由于参考面板中获得的 LD 不精确,依赖于准确 LD 估计的方法在真实数据应用中不太稳健。我们的研究结果为 GWAS 后分析中遗传相关估计提供了如何选择适当方法的指导。