Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.
Am J Hum Genet. 2020 Jan 2;106(1):3-12. doi: 10.1016/j.ajhg.2019.11.012. Epub 2019 Dec 19.
In biobank data analysis, most binary phenotypes have unbalanced case-control ratios, and this can cause inflation of type I error rates. Recently, a saddle point approximation (SPA) based single-variant test has been developed to provide an accurate and scalable method to test for associations of such phenotypes. For gene- or region-based multiple-variant tests, a few methods exist that can adjust for unbalanced case-control ratios; however, these methods are either less accurate when case-control ratios are extremely unbalanced or not scalable for large data analyses. To address these problems, we propose SKAT- and SKAT-O- type region-based tests; in these tests, the single-variant score statistic is calibrated based on SPA and efficient resampling (ER). Through simulation studies, we show that the proposed method provides well-calibrated p values. In contrast, when the case-control ratio is 1:99, the unadjusted approach has greatly inflated type I error rates (90 times that of exome-wide sequencing α = 2.5 × 10). Additionally, the proposed method has similar computation time to the unadjusted approaches and is scalable for large sample data. In our application, the UK Biobank whole-exome sequence data analysis of 45,596 unrelated European samples and 791 PheCode phenotypes identified 10 rare-variant associations with p value < 10, including the associations between JAK2 and myeloproliferative disease, HOXB13 and cancer of prostate, and F11 and congenital coagulation defects. All analysis summary results are publicly available through a web-based visual server, and this availability can help facilitate the identification of the genetic basis of complex diseases.
在生物库数据分析中,大多数二元表型的病例对照比例不平衡,这会导致 I 型错误率膨胀。最近,开发了一种基于鞍点逼近(SPA)的单变量检验方法,为检验此类表型的相关性提供了一种准确和可扩展的方法。对于基于基因或区域的多变量检验,有几种方法可以调整不平衡的病例对照比例;然而,当病例对照比例极不平衡时,这些方法要么不太准确,要么对于大型数据分析不可扩展。为了解决这些问题,我们提出了基于 SKAT 和 SKAT-O 的基于区域的检验方法;在这些检验中,单变量得分统计量是基于 SPA 和有效的重采样(ER)校准的。通过模拟研究,我们表明,所提出的方法提供了校准良好的 p 值。相比之下,当病例对照比例为 1:99 时,未经调整的方法大大增加了 I 型错误率(比全外显子测序α = 2.5×10 高出 90 倍)。此外,所提出的方法与未经调整的方法具有相似的计算时间,并且可扩展到大型样本数据。在我们的应用中,对 45596 个无关欧洲样本和 791 个 PheCode 表型的英国生物库全外显子序列数据分析确定了 10 个与 p 值<10 相关的罕见变异关联,包括 JAK2 与骨髓增生性疾病、HOXB13 与前列腺癌和 F11 与先天性凝血缺陷的关联。所有分析总结结果都通过基于网络的可视化服务器公开提供,这种可用性有助于促进复杂疾病遗传基础的鉴定。