Suppr超能文献

多源数据对计算表型分析的影响。

Impact of Diverse Data Sources on Computational Phenotyping.

作者信息

Wang Liwei, Olson Janet E, Bielinski Suzette J, St Sauver Jennifer L, Fu Sunyang, He Huan, Cicek Mine S, Hathcock Matthew A, Cerhan James R, Liu Hongfang

机构信息

Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States.

Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States.

出版信息

Front Genet. 2020 Jun 3;11:556. doi: 10.3389/fgene.2020.00556. eCollection 2020.

Abstract

Electronic health records (EHRs) are widely adopted with a great potential to serve as a rich, integrated source of phenotype information. Computational phenotyping, which extracts phenotypes from EHR data automatically, can accelerate the adoption and utilization of phenotype-driven efforts to advance scientific discovery and improve healthcare delivery. A list of computational phenotyping algorithms has been published but data fragmentation, i.e., incomplete data within one single data source, has been raised as an inherent limitation of computational phenotyping. In this study, we investigated the impact of diverse data sources on two published computational phenotyping algorithms, rheumatoid arthritis (RA) and type 2 diabetes mellitus (T2DM), using Mayo EHRs and Rochester Epidemiology Project (REP) which links medical records from multiple health care systems. Results showed that both RA (less prevalent) and T2DM (more prevalent) case selections were markedly impacted by data fragmentation, with positive predictive value (PPV) of 91.4 and 92.4%, false-negative rate (FNR) of 26.6 and 14% in Mayo data, respectively, PPV of 97.2 and 98.3%, FNR of 5.2 and 3.3% in REP. T2DM controls also contain biases, with PPV of 91.2% and FNR of 1.2% for Mayo. We further elaborated underlying reasons impacting the performance.

摘要

电子健康记录(EHRs)被广泛采用,极有可能成为丰富、综合的表型信息来源。计算表型分析可从EHR数据中自动提取表型,能加速表型驱动的研究工作的采用和利用,以推动科学发现并改善医疗服务。已有一份计算表型分析算法列表发表,但数据碎片化,即单个数据源内的数据不完整,已被视为计算表型分析的一个固有局限。在本研究中,我们使用梅奥EHRs和罗切斯特流行病学项目(REP,该项目将多个医疗系统的病历相链接),调查了不同数据源对两种已发表的计算表型分析算法(类风湿性关节炎(RA)和2型糖尿病(T2DM))的影响。结果显示,RA(患病率较低)和T2DM(患病率较高)的病例选择均受到数据碎片化的显著影响,在梅奥数据中,阳性预测值(PPV)分别为91.4%和92.4%,假阴性率(FNR)分别为26.6%和14%;在REP中,PPV分别为97.2%和98.3%,FNR分别为5.2%和3.3%。T2DM对照组也存在偏差,在梅奥数据中,PPV为91.2%,FNR为1.2%。我们进一步阐述了影响性能的潜在原因。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验