Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA.
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.
Am J Hum Genet. 2020 Aug 6;107(2):222-233. doi: 10.1016/j.ajhg.2020.06.003. Epub 2020 Jun 25.
With increasing biobanking efforts connecting electronic health records and national registries to germline genetics, the time-to-event data analysis has attracted increasing attention in the genetics studies of human diseases. In time-to-event data analysis, the Cox proportional hazards (PH) regression model is one of the most used approaches. However, existing methods and tools are not scalable when analyzing a large biobank with hundreds of thousands of samples and endpoints, and they are not accurate when testing low-frequency and rare variants. Here, we propose a scalable and accurate method, SPACox (a saddlepoint approximation implementation based on the Cox PH regression model), that is applicable for genome-wide scale time-to-event data analysis. SPACox requires fitting a Cox PH regression model only once across the genome-wide analysis and then uses a saddlepoint approximation (SPA) to calibrate the test statistics. Simulation studies show that SPACox is 76-252 times faster than other existing alternatives, such as gwasurvivr, 185-511 times faster than the standard Wald test, and more than 6,000 times faster than the Firth correction and can control type I error rates at the genome-wide significance level regardless of minor allele frequencies. Through the analysis of UK Biobank inpatient data of 282,871 white British European ancestry samples, we show that SPACox can efficiently analyze large sample sizes and accurately control type I error rates. We identified 611 loci associated with time-to-event phenotypes of 12 common diseases, of which 38 loci would be missed within a logistic regression framework with a binary phenotype defined as event occurrence status during the follow-up period.
随着生物库研究工作将电子健康记录和国家注册中心与种系遗传学联系起来,事件时间数据分析在人类疾病的遗传学研究中受到了越来越多的关注。在事件时间数据分析中,Cox 比例风险(PH)回归模型是最常用的方法之一。然而,现有的方法和工具在分析具有数十万样本和终点的大型生物库时,其可扩展性不足,并且在测试低频和罕见变体时准确性不高。在这里,我们提出了一种可扩展且准确的方法 SPACox(基于 Cox PH 回归模型的鞍点逼近实现),适用于全基因组规模的事件时间数据分析。SPACox 仅在全基因组分析中拟合一次 Cox PH 回归模型,然后使用鞍点逼近(SPA)来校准检验统计量。模拟研究表明,SPACox 比其他现有替代方法(如 gwasurvivr)快 76-252 倍,比标准 Wald 检验快 185-511 倍,比 Firth 校正快 6000 多倍,并且无论次要等位基因频率如何,都可以控制全基因组显著水平的 I 型错误率。通过对 282,871 名英国生物库白人欧洲血统样本的住院数据进行分析,我们表明 SPACox 可以有效地分析大样本量,并准确控制 I 型错误率。我们确定了与 12 种常见疾病的事件时间表型相关的 611 个基因座,其中 38 个基因座将在使用二元表型定义为随访期间事件发生状态的逻辑回归框架中丢失。