Wang Lu, Damrauer Scott M, Zhang Hong, Zhang Alan X, Xiao Rui, Moore Jason H, Chen Jinbo
Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.
Division of Vascular Surgery and Endovascular Therapy, Hospital of the University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.
Genet Epidemiol. 2017 Dec;41(8):790-800. doi: 10.1002/gepi.22080. Epub 2017 Oct 11.
The linkage between electronic health records (EHRs) and genotype data makes it plausible to study the genetic susceptibility of a wide range of disease phenotypes. Despite that EHR-derived phenotype data are subjected to misclassification, it has been shown useful for discovering susceptible genes, particularly in the setting of phenome-wide association studies (PheWAS). It is essential to characterize discovered associations using gold standard phenotype data by chart review. In this work, we propose a genotype stratified case-control sampling strategy to select subjects for phenotype validation. We develop a closed-form maximum-likelihood estimator for the odds ratio parameters and a score statistic for testing genetic association using the combined validated and error-prone EHR-derived phenotype data, and assess the extent of power improvement provided by this approach. Compared with case-control sampling based only on EHR-derived phenotype data, our genotype stratified strategy maintains nominal type I error rates, and result in higher power for detecting associations. It also corrects the bias in the odds ratio parameter estimates, and reduces the corresponding variance especially when the minor allele frequency is small.
电子健康记录(EHR)与基因型数据之间的联系使得研究广泛疾病表型的遗传易感性成为可能。尽管源自EHR的表型数据存在错误分类,但已证明其对于发现易感基因很有用,尤其是在全表型组关联研究(PheWAS)中。通过图表审查使用金标准表型数据来表征发现的关联至关重要。在这项工作中,我们提出了一种基因型分层病例对照抽样策略,以选择用于表型验证的受试者。我们针对比值比参数开发了一种封闭式最大似然估计器,并使用经过验证的和容易出错的源自EHR的表型数据组合来开发用于检验基因关联的得分统计量,并评估此方法提供的功效提高程度。与仅基于源自EHR的表型数据进行病例对照抽样相比,我们的基因型分层策略保持了名义上的I型错误率,并在检测关联时具有更高的功效。它还纠正了比值比参数估计中的偏差,并减小了相应的方差,尤其是在次要等位基因频率较小时。