一种用于全基因组事件时间数据分析的快速而准确的方法及其在 UK Biobank 中的应用。

A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank.

机构信息

Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA; Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA.

Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.

出版信息

Am J Hum Genet. 2020 Aug 6;107(2):222-233. doi: 10.1016/j.ajhg.2020.06.003. Epub 2020 Jun 25.

DOI:10.1016/j.ajhg.2020.06.003

PMID:32589924

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7413891/

Abstract

With increasing biobanking efforts connecting electronic health records and national registries to germline genetics, the time-to-event data analysis has attracted increasing attention in the genetics studies of human diseases. In time-to-event data analysis, the Cox proportional hazards (PH) regression model is one of the most used approaches. However, existing methods and tools are not scalable when analyzing a large biobank with hundreds of thousands of samples and endpoints, and they are not accurate when testing low-frequency and rare variants. Here, we propose a scalable and accurate method, SPACox (a saddlepoint approximation implementation based on the Cox PH regression model), that is applicable for genome-wide scale time-to-event data analysis. SPACox requires fitting a Cox PH regression model only once across the genome-wide analysis and then uses a saddlepoint approximation (SPA) to calibrate the test statistics. Simulation studies show that SPACox is 76-252 times faster than other existing alternatives, such as gwasurvivr, 185-511 times faster than the standard Wald test, and more than 6,000 times faster than the Firth correction and can control type I error rates at the genome-wide significance level regardless of minor allele frequencies. Through the analysis of UK Biobank inpatient data of 282,871 white British European ancestry samples, we show that SPACox can efficiently analyze large sample sizes and accurately control type I error rates. We identified 611 loci associated with time-to-event phenotypes of 12 common diseases, of which 38 loci would be missed within a logistic regression framework with a binary phenotype defined as event occurrence status during the follow-up period.

摘要

随着生物库研究工作将电子健康记录和国家注册中心与种系遗传学联系起来，事件时间数据分析在人类疾病的遗传学研究中受到了越来越多的关注。在事件时间数据分析中，Cox 比例风险（PH）回归模型是最常用的方法之一。然而，现有的方法和工具在分析具有数十万样本和终点的大型生物库时，其可扩展性不足，并且在测试低频和罕见变体时准确性不高。在这里，我们提出了一种可扩展且准确的方法 SPACox（基于 Cox PH 回归模型的鞍点逼近实现），适用于全基因组规模的事件时间数据分析。SPACox 仅在全基因组分析中拟合一次 Cox PH 回归模型，然后使用鞍点逼近（SPA）来校准检验统计量。模拟研究表明，SPACox 比其他现有替代方法（如 gwasurvivr）快 76-252 倍，比标准 Wald 检验快 185-511 倍，比 Firth 校正快 6000 多倍，并且无论次要等位基因频率如何，都可以控制全基因组显著水平的 I 型错误率。通过对 282,871 名英国生物库白人欧洲血统样本的住院数据进行分析，我们表明 SPACox 可以有效地分析大样本量，并准确控制 I 型错误率。我们确定了与 12 种常见疾病的事件时间表型相关的 611 个基因座，其中 38 个基因座将在使用二元表型定义为随访期间事件发生状态的逻辑回归框架中丢失。

相似文献

A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank.一种用于全基因组事件时间数据分析的快速而准确的方法及其在 UK Biobank 中的应用。

Am J Hum Genet. 2020 Aug 6;107(2):222-233. doi: 10.1016/j.ajhg.2020.06.003. Epub 2020 Jun 25.

A Fast and Accurate Method for Genome-wide Scale Phenome-wide G × E Analysis and Its Application to UK Biobank.一种用于全基因组规模表型全基因组 G × E 分析的快速准确方法及其在 UK Biobank 中的应用。

Am J Hum Genet. 2019 Dec 5;105(6):1182-1192. doi: 10.1016/j.ajhg.2019.10.008. Epub 2019 Nov 14.

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.在大规模的遗传关联研究中，有效地控制病例-对照不平衡和样本相关性。

Nat Genet. 2018 Sep;50(9):1335-1341. doi: 10.1038/s41588-018-0184-y. Epub 2018 Aug 13.

UK Biobank Whole-Exome Sequence Binary Phenome Analysis with Robust Region-Based Rare-Variant Test.英国生物银行全外显子组序列双表型分析与稳健基于区域的罕见变异测试。

Am J Hum Genet. 2020 Jan 2;106(1):3-12. doi: 10.1016/j.ajhg.2019.11.012. Epub 2019 Dec 19.

Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes.高效混合模型方法在大规模全基因组关联研究中对有序分类表型的应用。

Am J Hum Genet. 2021 May 6;108(5):825-839. doi: 10.1016/j.ajhg.2021.03.019. Epub 2021 Apr 8.

Efficient and accurate framework for genome-wide gene-environment interaction analysis in large-scale biobanks.用于大规模生物样本库中全基因组基因-环境相互作用分析的高效准确框架。

Nat Commun. 2025 Mar 29;16(1):3064. doi: 10.1038/s41467-025-57887-3.

Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks.在大型生物库中进行全基因组生存关联分析的高效准确脆弱性模型方法。

Nat Commun. 2022 Sep 16;13(1):5437. doi: 10.1038/s41467-022-32885-x.

A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS.一种用于二元表型检测的快速准确算法及其在全表型组关联研究中的应用

Am J Hum Genet. 2017 Jul 6;101(1):37-49. doi: 10.1016/j.ajhg.2017.05.014. Epub 2017 Jun 8.

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.一种快速且可扩展的大规模超高维稀疏回归框架及其在 UK Biobank 中的应用。

PLoS Genet. 2020 Oct 23;16(10):e1009141. doi: 10.1371/journal.pgen.1009141. eCollection 2020 Oct.

Saddlepoint approximations to score test statistics in logistic regression for analyzing genome-wide association studies.基于鞍点逼近的逻辑回归评分检验统计量在全基因组关联研究分析中的应用。

Stat Med. 2023 Jul 20;42(16):2746-2759. doi: 10.1002/sim.9746. Epub 2023 Apr 24.

引用本文的文献

Gene-Diet Interaction Analysis in UK Biobank Identified Genetic Loci That Modify the Association Between Fish Oil Supplementation and the Incidence of Dementia.英国生物银行中的基因-饮食相互作用分析确定了可改变鱼油补充剂与痴呆症发病率之间关联的基因位点。

Curr Dev Nutr. 2025 Aug 5;9(9):107524. doi: 10.1016/j.cdnut.2025.107524. eCollection 2025 Sep.

LDAK-KVIK performs fast and powerful mixed-model association analysis of quantitative and binary phenotypes.LDAK-KVIK对定量和二元表型进行快速且强大的混合模型关联分析。

Nat Genet. 2025 Aug 11. doi: 10.1038/s41588-025-02286-z.

BAYESIAN VARIABLE SELECTION IN A COX PROPORTIONAL HAZARDS MODEL WITH THE "SUM OF SINGLE EFFECTS" PRIOR.具有“单效应之和”先验的Cox比例风险模型中的贝叶斯变量选择

ArXiv. 2025 Jun 6:arXiv:2506.06233v1.

Myeloid cell genome-wide screen identifies variants associated with Mycobacterium tuberculosis-induced cytokine transcriptional responses.髓系细胞全基因组筛选鉴定出与结核分枝杆菌诱导的细胞因子转录反应相关的变异体。

J Clin Invest. 2025 May 22;135(14). doi: 10.1172/JCI179822. eCollection 2025 Jul 15.

Investigation of the degree of family history of diabetes in different clusters of newly diagnosed type 2 diabetes in Thailand.泰国新诊断2型糖尿病不同聚类中糖尿病家族史程度的调查。

Ann Med. 2025 Dec;57(1):2500697. doi: 10.1080/07853890.2025.2500697. Epub 2025 May 8.

Combining genetic proxies of drug targets and time-to-event analyses from longitudinal observational data to identify target patient populations.结合药物靶点的遗传替代指标和纵向观察数据的事件发生时间分析，以识别目标患者群体。

BMC Cardiovasc Disord. 2025 May 7;25(1):353. doi: 10.1186/s12872-025-04753-1.

Nat Commun. 2025 Mar 29;16(1):3064. doi: 10.1038/s41467-025-57887-3.

Exploring the protective role of maternal lung cancer history on allergic rhinitis.探讨母亲肺癌病史对过敏性鼻炎的保护作用。

J Clin Biochem Nutr. 2025 Mar;76(2):156-163. doi: 10.3164/jcbn.24-172. Epub 2024 Dec 27.

Donor and Recipient Polygenic Risk Scores Influence Kidney Transplant Function.供体和受体的多基因风险评分影响肾移植功能。

Transpl Int. 2025 Mar 4;38:14171. doi: 10.3389/ti.2025.14171. eCollection 2025.

Ancestral origins and post-admixture adaptive evolution of highland Tajiks.高原塔吉克人的祖先起源及混合后适应性进化

Natl Sci Rev. 2024 Aug 20;11(9):nwae284. doi: 10.1093/nsr/nwae284. eCollection 2024 Sep.

本文引用的文献

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts.基于区域的大型生物库和队列关联检验的可扩展广义线性混合模型。

Nat Genet. 2020 Jun;52(6):634-639. doi: 10.1038/s41588-020-0621-6. Epub 2020 May 18.

UK Biobank Whole-Exome Sequence Binary Phenome Analysis with Robust Region-Based Rare-Variant Test.英国生物银行全外显子组序列双表型分析与稳健基于区域的罕见变异测试。

Am J Hum Genet. 2020 Jan 2;106(1):3-12. doi: 10.1016/j.ajhg.2019.11.012. Epub 2019 Dec 19.

The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities.基于与电子健康记录相关联的生物银行的健康研究的新兴领域：现有资源、统计挑战和潜在机会。

Stat Med. 2020 Mar 15;39(6):773-800. doi: 10.1002/sim.8445. Epub 2019 Dec 20.

Am J Hum Genet. 2019 Dec 5;105(6):1182-1192. doi: 10.1016/j.ajhg.2019.10.008. Epub 2019 Nov 14.

Cox regression increases power to detect genotype-phenotype associations in genomic studies using the electronic health record.在利用电子健康记录的基因组研究中，Cox回归增强了检测基因型与表型关联的效能。

BMC Genomics. 2019 Nov 4;20(1):805. doi: 10.1186/s12864-019-6192-1.

Mapping ICD-10 and ICD-10-CM Codes to Phecodes: Workflow Development and Initial Evaluation.将ICD - 10和ICD - 10 - CM编码映射到疾病编码：工作流程开发与初步评估

JMIR Med Inform. 2019 Nov 29;7(4):e14325. doi: 10.2196/14325.

Genomic Association Analysis Reveals Variants Associated With Blood Pressure Response to Beta-Blockers in European Americans.基因组关联分析揭示了与欧洲裔美国人对β受体阻滞剂降压反应相关的变异。

Clin Transl Sci. 2019 Sep;12(5):497-504. doi: 10.1111/cts.12643. Epub 2019 Jun 4.

Robust meta-analysis of biobank-based genome-wide association studies with unbalanced binary phenotypes.基于生物库的不平衡二分类表型全基因组关联研究的稳健荟萃分析。

Genet Epidemiol. 2019 Jul;43(5):462-476. doi: 10.1002/gepi.22197. Epub 2019 Feb 22.

gwasurvivr: an R package for genome-wide survival analysis.gwasurvivr：一个用于全基因组生存分析的 R 包。

Bioinformatics. 2019 Jun 1;35(11):1968-1970. doi: 10.1093/bioinformatics/bty920.

The UK Biobank resource with deep phenotyping and genomic data.英国生物银行资源库，具有深度表型和基因组数据。

Nature. 2018 Oct;562(7726):203-209. doi: 10.1038/s41586-018-0579-z. Epub 2018 Oct 10.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。