Hughey Jacob J, Rhoades Seth D, Fu Darwin Y, Bastarache Lisa, Denny Joshua C, Chen Qingxia
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA.
BMC Genomics. 2019 Nov 4;20(1):805. doi: 10.1186/s12864-019-6192-1.
The growth of DNA biobanks linked to data from electronic health records (EHRs) has enabled the discovery of numerous associations between genomic variants and clinical phenotypes. Nonetheless, although clinical data are generally longitudinal, standard approaches for detecting genotype-phenotype associations in such linked data, notably logistic regression, do not naturally account for variation in the period of follow-up or the time at which an event occurs. Here we explored the advantages of quantifying associations using Cox proportional hazards regression, which can account for the age at which a patient first visited the healthcare system (left truncation) and the age at which a patient either last visited the healthcare system or acquired a particular phenotype (right censoring).
In comprehensive simulations, we found that, compared to logistic regression, Cox regression had greater power at equivalent Type I error. We then scanned for genotype-phenotype associations using logistic regression and Cox regression on 50 phenotypes derived from the EHRs of 49,792 genotyped individuals. Consistent with the findings from our simulations, Cox regression had approximately 10% greater relative sensitivity for detecting known associations from the NHGRI-EBI GWAS Catalog. In terms of effect sizes, the hazard ratios estimated by Cox regression were strongly correlated with the odds ratios estimated by logistic regression.
As longitudinal health-related data continue to grow, Cox regression may improve our ability to identify the genetic basis for a wide range of human phenotypes.
与电子健康记录(EHR)数据相关联的DNA生物样本库的发展,使得人们发现了基因组变异与临床表型之间的众多关联。尽管如此,虽然临床数据通常是纵向的,但在这种关联数据中检测基因型-表型关联的标准方法,尤其是逻辑回归,并未自然地考虑随访期的变化或事件发生的时间。在此,我们探讨了使用Cox比例风险回归来量化关联的优势,该方法可以考虑患者首次就诊于医疗系统的年龄(左删失)以及患者最后一次就诊于医疗系统或获得特定表型的年龄(右删失)。
在全面的模拟中,我们发现,与逻辑回归相比,Cox回归在同等I型错误水平下具有更大的检验效能。然后,我们使用逻辑回归和Cox回归对来自49792名基因分型个体的EHR中的50种表型进行了基因型-表型关联扫描。与我们模拟的结果一致,Cox回归在检测来自NHGRI-EBI全基因组关联研究目录中的已知关联时,相对灵敏度大约高10%。在效应大小方面,Cox回归估计的风险比与逻辑回归估计的比值比高度相关。
随着与健康相关的纵向数据不断增加,Cox回归可能会提高我们识别广泛人类表型遗传基础的能力。