Benner Christian, Havulinna Aki S, Järvelin Marjo-Riitta, Salomaa Veikko, Ripatti Samuli, Pirinen Matti
Institute for Molecular Medicine Finland, University of Helsinki, 00014 Helsinki, Finland; Department of Public Health, University of Helsinki, 00014 Helsinki, Finland.
Institute for Molecular Medicine Finland, University of Helsinki, 00014 Helsinki, Finland; National Institute for Health and Welfare, 00271 Helsinki, Finland.
Am J Hum Genet. 2017 Oct 5;101(4):539-551. doi: 10.1016/j.ajhg.2017.08.012. Epub 2017 Sep 21.
During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research.
在过去几年中,已经开发出各种新颖的统计方法,用于利用全基因组关联研究(GWAS)的汇总统计数据进行精细定位。尽管这些方法需要有关变异之间连锁不平衡(LD)的信息,但与从原始个体水平的GWAS数据估计LD结构相比,尚未对从参考基因型面板估计LD结构的性能进行全面评估。利用来自芬兰和英国生物银行的群体基因型数据,我们在此表明,来自目标人群的1000名个体的参考面板对于多达10000名个体的GWAS队列是足够的,而较小的面板,如来自千人基因组计划的面板,则应避免使用。我们还从理论和实证两方面表明,参考面板的大小需要与GWAS样本大小成比例;这对于这些方法在正在进行的GWAS荟萃分析和大型生物银行研究中的应用具有重要意义。我们通过提供软件工具并推荐共享LD信息的做法来提高遗传学研究中汇总统计数据的利用效率,从而得出结论。