Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
Nat Genet. 2020 Jun;52(6):634-639. doi: 10.1038/s41588-020-0621-6. Epub 2020 May 18.
With very large sample sizes, biobanks provide an exciting opportunity to identify genetic components of complex traits. To analyze rare variants, region-based multiple-variant aggregate tests are commonly used to increase power for association tests. However, because of the substantial computational cost, existing region-based tests cannot analyze hundreds of thousands of samples while accounting for confounders such as population stratification and sample relatedness. Here we propose a scalable generalized mixed-model region-based association test, SAIGE-GENE, that is applicable to exome-wide and genome-wide region-based analysis for hundreds of thousands of samples and can account for unbalanced case-control ratios for binary traits. Through extensive simulation studies and analysis of the HUNT study with 69,716 Norwegian samples and the UK Biobank data with 408,910 White British samples, we show that SAIGE-GENE can efficiently analyze large-sample data (N > 400,000) with type I error rates well controlled.
利用非常大的样本量,生物库为鉴定复杂性状的遗传成分提供了一个令人兴奋的机会。为了分析罕见变异,通常使用基于区域的多种变异聚合测试来提高关联测试的效力。然而,由于计算成本很高,现有的基于区域的测试无法在考虑混杂因素(如群体分层和样本亲缘关系)的情况下分析数十万个样本。在这里,我们提出了一种可扩展的广义混合模型基于区域的关联测试 SAIGE-GENE,它适用于数十万个样本的外显子组和基因组范围内的基于区域的分析,并且可以针对二元性状的不平衡病例对照比进行分析。通过广泛的模拟研究和对包含 69716 名挪威样本的 HUNT 研究和包含 408910 名英国白种人样本的 UK Biobank 数据的分析,我们表明,SAIGE-GENE 可以有效地分析具有良好控制 I 型错误率的大样本数据(N>400000)。