Wang Liwei, Olson Janet E, Bielinski Suzette J, St Sauver Jennifer L, Fu Sunyang, He Huan, Cicek Mine S, Hathcock Matthew A, Cerhan James R, Liu Hongfang
Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States.
Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States.
Front Genet. 2020 Jun 3;11:556. doi: 10.3389/fgene.2020.00556. eCollection 2020.
Electronic health records (EHRs) are widely adopted with a great potential to serve as a rich, integrated source of phenotype information. Computational phenotyping, which extracts phenotypes from EHR data automatically, can accelerate the adoption and utilization of phenotype-driven efforts to advance scientific discovery and improve healthcare delivery. A list of computational phenotyping algorithms has been published but data fragmentation, i.e., incomplete data within one single data source, has been raised as an inherent limitation of computational phenotyping. In this study, we investigated the impact of diverse data sources on two published computational phenotyping algorithms, rheumatoid arthritis (RA) and type 2 diabetes mellitus (T2DM), using Mayo EHRs and Rochester Epidemiology Project (REP) which links medical records from multiple health care systems. Results showed that both RA (less prevalent) and T2DM (more prevalent) case selections were markedly impacted by data fragmentation, with positive predictive value (PPV) of 91.4 and 92.4%, false-negative rate (FNR) of 26.6 and 14% in Mayo data, respectively, PPV of 97.2 and 98.3%, FNR of 5.2 and 3.3% in REP. T2DM controls also contain biases, with PPV of 91.2% and FNR of 1.2% for Mayo. We further elaborated underlying reasons impacting the performance.
电子健康记录(EHRs)被广泛采用,极有可能成为丰富、综合的表型信息来源。计算表型分析可从EHR数据中自动提取表型,能加速表型驱动的研究工作的采用和利用,以推动科学发现并改善医疗服务。已有一份计算表型分析算法列表发表,但数据碎片化,即单个数据源内的数据不完整,已被视为计算表型分析的一个固有局限。在本研究中,我们使用梅奥EHRs和罗切斯特流行病学项目(REP,该项目将多个医疗系统的病历相链接),调查了不同数据源对两种已发表的计算表型分析算法(类风湿性关节炎(RA)和2型糖尿病(T2DM))的影响。结果显示,RA(患病率较低)和T2DM(患病率较高)的病例选择均受到数据碎片化的显著影响,在梅奥数据中,阳性预测值(PPV)分别为91.4%和92.4%,假阴性率(FNR)分别为26.6%和14%;在REP中,PPV分别为97.2%和98.3%,FNR分别为5.2%和3.3%。T2DM对照组也存在偏差,在梅奥数据中,PPV为91.2%,FNR为1.2%。我们进一步阐述了影响性能的潜在原因。